[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Jun 28 22:50:00 CEST 2006


LDC2006S35*
CSLU: Multilanguage Telephone Speech Version 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35>
*

LDC2006S31
*NIST 2003 Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
*

LDC2006T12
*Spanish Gigaword First Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>

*

The Linguistic Data Consortium (LDC) would like to announce the
availability of three new publications.

------------------------------------------------------------------------


(1) The CSLU: Multilanguage Telephone Speech Version 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35>
corpus consists of telephone speech from eleven languages: English,
Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish,
Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances
(eg. days of the week) as well as fluent continuous speech. The current
release includes recorded utterances from about 2052 speakers, for a
total of about 38.5 hours of speech. Time-aligned phonetic
transcriptions for 619 of the utterances are also included. For the
data collection, the sampling rate was 8khz and the files were stored in
16bit linear format on a UNIX file system. Each utterance was recorded
as a separate file.

*

(2) The goal of the NIST Language Recognition Evaluation (LRE) is to
establish the baseline of current performance capability for language
recognition of conversational telephone speech and to lay the groundwork
for further research efforts in the field. The series had its first
evaluation in 1996. The 2003 NIST Language Recognition Evaluation
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
(LRE-03) was part of this ongoing series of evaluations of language
recognition technology. The task evaluated was the detection of a given
target language. Given a test segment of speech, a target language was
assigned as a test hypothesis, and the task was to determine whether
this test hypothesis was true or false.

Each speech file is one side of a "4 wire" telephone conversation
represented as 8-bit, 8kHz mulaw data. There are 7990 speech files in
sphere(.sph) format for a total of around six hours of speech. The
speech data was compiled from the LDC's CALLFRIEND, CALLHOME, and
SWITCHBOARD-2 corpora.

*

(3) The Spanish Gigaword First Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>
is a comprehensive archive of newswire text data that has been acquired
over several years by the Linguistic Data Consortium; some of the data
included has been released previously in other LDC corpora.

The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as follows:

* Agence France-Presse, Spanish Service, May 1994 - Dec 2005
* Associated Press Worldstream, Spanish, Nov 1993 - Dec 2005
* Xinhua News Agency, Spanish Service, Sep 2001 - Dec 2005


------------------------------------------------------------------------


If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
1275.



--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20060628/b2a0bb01/attachment.html


More information about the Corpora-archive mailing list