[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Mar 1 17:21:02 CET 2010


/New Publications: /

LDC2010S01* *- *Fisher Spanish Speech* - <#speech>

LDC2010T04* - Fisher Spanish - Transcripts - <#transcripts>*

/Other news:/

*- 65,000th LDC Corpus Distributed! -* <#65>

*- 2010 Publications Pipeline -* <#pipeline>

** <imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E753331#pipeline>

------------------------------------------------------------------------

*New Publications*

(1) Fisher Spanish Speech <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01> was developed by LDC and consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. Full orthographic transcripts of these audio files are available in Fisher Spanish - Transcripts (LDC2010T04) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04>.

The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Under the Fisher protocol, a very large number of participants each make a few calls of short duration speaking to other participants, whom they typically do not know, about assigned topics. This maximizes inter-speaker variation and vocabulary breadth although it also increases formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so; however the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study.

To encourage a broad range of vocabulary, Fisher participants are asked to speak on an assigned topic which is selected at random from a list, which changes every 24 hours and which is assigned to all subjects paired on that day. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol.

In collecting data for this corpus, attempts were made to provide a representative distribution of subjects across a variety of demographic categories including: gender, age, dialect region, and education level. Native speakers of Caribbean Spanish and non-Caribbean Spanish were recruited from within the continental United States and Puerto Rico.

The speech recordings consist of 819 telephone conversations of 10 to 12 minutes in duration. They are provided as digital audio files in NIST SPHERE format (1024-byte ASCII file headers). The conversations were recorded as 2-channel mu-law sample data with 8000 samples per second (as captured from the public telephone network).

[ top <#top>]

*

(2) Fisher Spanish - Transcripts <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T04> was developed by LDC and contains full orthographic transcripts of the telephone speech in Fisher Spanish Speech (LDC2010S01) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01>. Transcripts cover roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.

The transcript files are in plain-text, tab-delimited format (tdf) with UTF-8 character encoding. They were created with the LDC-developed transcription tool "XTrans" <http://www.ldc.upenn.edu/tools/XTrans/>, which allowed for improved handling of multi-channel audio and overlapping speakers. XTrans is available from LDC.

Transcribers followed LDC's Transcription Guidelines (NQTR) <imap://ldc@mail.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E751173/;section=2.2?part=1.1.2&filename=trans_guide_nqrt_span.doc>, which are included with the documentation for this release.

Fisher Spanish Speech (LDC2010S01) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S01> provides the digital audio used as the basis for the transcriptions in this corpus, in the form of 2-channel mu-law sample data with 8000 samples per second (as captured from the public telephone network), for 819 telephone conversations of 10 to 12 minutes in duration. The audio files are in NIST SPHERE format (1024-byte ASCII file headers).

[ top <#top>]

*65,000th LDC Corpus Distributed!*

* *LDC has recently reached another milestone. Two years after having distributed our 50,000th corpus, we have just distributed our 65,000th! To help us celebrate, we took the names of all the organizations that had licensed data on the day we distributed our 65,000th corpus and tossed them into a Phillies baseball cap.

We then randomly drew a name, and the winner is ...Swarthmore College and Universidad Carlos III de Madrid! That's not a typo, we have two lucky winners! We are celebrating our 65,000th distribution by awarding a benefit of US$2000 each to both Swarthmore College and Universidad Carlos III de Madrid. The benefit can be used towards membership or data licensing fees at any time this year.

Swarthmore College and Universidad Carlos III de Madrid join our other recipients of landmark corpora distributions:

* Helsinki University of Technology, Adaptive Informatics

Research Centre (AIRC) - licensed our 50,000th distribution in

January 2008.

* Instituto de Engenharia de Sistemas e Computadores (INESC) -

licensed our 40,000th distribution in November 2006.

* University of Hawai'i, Manoa, Language Analysis and

Experimentation Laboratories - licensed our 15,000th distribution

in April 2002.

We would like to thank both members and non-members for helping the LDC reach this landmark distribution. The unceasing demand for LDC data from over 2800 organizations supports our mission to develop and share resources for research in human language technologies.

About our winners:

Swarthmore College ~ The Department of Computer Science offers

courses that emphasize the fundamental concepts of computer science,

treating today's languages and systems as current examples of the

underlying concepts. By educating students to think conceptually, we

are preparing them to adapt to developments in this dynamic field.

Universidad Carlos III de Madrid ~ The Multimedia Processing Group

aims to make a significant research contribution to the field of

multimedia processing, especially focusing on combining signal

analysis tools with emerging machine learning methods. Projects

include automatic multimedia indexing, automatic speech recognition,

and last-generation video coding.

[ top <#top>]

***2010 Publications Pipeline*

For Membership Year 2010 (MY2010), we anticipate releasing a varied selection of publications. Many publications are still in development, but here is a glimpse of what is in the pipeline for MY2010. Please note that this list is tentative and subject to modifications. Our planned publications for the coming months include:

/Arabic Treebank: Part 3 v 3.2/ ~ a revision of Arabic Treebank:

Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis (LDC2005T20).

The full Arabic Treebank: Part 3 has been revised according to the

new Arabic Treebank annotation guidelines. The Arabic Treebank

project consists of two distinct phases: (a) Part-of-Speech (POS)

tagging which divides the text into lexical tokens, and gives

relevant information about each token such as lexical category,

inflectional features, and a gloss, and (b) Arabic Treebanking which

characterizes the constituent structures of word sequences, provides

categories for each non-terminal node, and identifies null elements,

co-reference, traces, etc. on-terminal node. Arabic Treebank: Part

3 v 3.2 consists of 599 newswire stories from An Nahar.

/Chinese Treebank 7.0/ ~ this release encompasses 2400 text files,

containing 45000 sentences, 1.1 million words and 1.65 million hanzi

(Chinese characters). The data is provided in two encodings: GBK and

UTF-8, and the annotation has Penn Treebank-style labeled

brackets.

/Chinese Web 5-gram Version 1/ ~ contains n-grams (unigrams to

five-grams) and their observed counts in 880 billion tokens of

Chinese web data collected in March 2008. All text was converted to

UTF-8. A simple segmenter using the same algorithm used to generate

the data is included. The set contains 3.9 billion n-grams total.

/NPS Chat Corpus Version 1.0/ ~ consists of 10,567 posts gathered

from age-specific chat rooms. Each file is a recording transcript

from one of these chat rooms for a short period on a particular day.

In order to comply with the chat services' terms of service, the

posts have been privacy-masked. Each post is annotated with a chat

dialog-act tag, and individual tokens within each post are annotated

with part-of-speech tags.

/WTIMIT/ ~ is a mobile wideband (i.e., 50 Hz -- 7kHz) telephone

adjunct to TIMIT (LDC93S1). WTIMIT has been derived as follows:

the original TIMIT speech files at 16 kHz sampling rate were

concatenated to 11 signal chunks each being preceded by a 4 second

calibration tone. These speech chunks were transmitted via two

prepared Nokia 6220 mobile phones over T-Mobile's 3G wideband mobile

network in The Hague, The Netherlands, employing the Adaptive

Multirate Wideband (AMR-WB) speech codec. After data acquisition and

deconcatenation by maximizing the normalized cross-correlation with

the original speech files, a database was obtained that is time

aligned with the original TIMIT data with good precision.

Accordingly, all TIMIT label files can still be used. WTIMIT is

suitable for research on speech quality and intelligibility, and

investigations on possible wideband upgrades of network-sided IVR

systems with retrained or bandwidth extended acoustic models for

automatic speech recognition. WTIMIT will be presented at LREC2010.

2010 Subscription Members are automatically sent all MY2010 data as it is released. 2010 Standard Members are entitled to request 16 corpora for free from MY2010. Non-members may license most data for research-use only.

[ top <#top>]

------------------------------------------------------------------------

Ilya Ahtaridis Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 14881 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20100301/65e4085a/attachment.txt>



More information about the Corpora mailing list