[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Dec 12 20:02:00 CET 2005

** New LDC Online Membership! **

** CSLU: 22 Languages Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26> **

** Chinese <-> English Name Entity Lists (v1.0)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34> **

** The West Point Company G3 American English Speech Data Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S30> *

The Linguistic Data Consortium (LDC) would like to announce a new
membership option, the LDC Online Membership, and provide information
regarding our new publications.


*LDC Online Membership*

The Linguistic Data Consortium is pleased to announce the LDC Online
Membership, which is now available for the 2006 Membership year. LDC
Online contains a continuously growing, indexed collection of Arabic,
Chinese and English newswire text, millions of words of English
telephone speech from the Switchboard and Fisher collections and the
American English Spoken Lexicon, as well as the full text of the Brown
corpus. With LDC Online, users can search textual data and play audio
extracts for transcribed utterances on standard web browsers. LDC will
continue to add new material to LDC Online, including Spanish, Arabic,
and Chinese conversational telephone data in 2006.

The LDC Online Membership is a reduced cost alternative providing
interactive access to a growing subset of LDC data to users who do not
have a need for linguistic data on media. Current LDC members already
have access to all LDC Online resources. The LDC Online Membership is
available to Non-Profit and U.S. government organizations for $1,000
(USD) per calendar year (January to December). The obligations and data
usage restrictions of the LDC Online Membership are contained in the LDC
Online Membership Agreement

We invite you to try LDC Online if you have not already done so. Please
go to http://online.ldc.upenn.edu for a free, limited demonstration and
to sign up for a non-member LDC Online account. To become an LDC Online
member or to request additional information, contact the LDC Membership
Department at ldc at ldc.upenn.edu.

We hope that the LDC Online Membership will enhance your linguistic
research and your association with the LDC.

*New Publications

(1) The CSLU: 22 Language Corpus
was produced by the Center for Spoken Language Understanding at Oregon
Health & Science University. The corpus consists of telephone speech
from the following languages: Arabic, Cantonese, Czech, Farsi, German,
Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish,
Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and
English. The corpus contains fixed vocabulary utterances (e.g. days of
the week) as well as fluent continuous speech. Each of the 50191
utterances is verified by a native speaker to determine if the caller
followed instructions when answering the prompts. For this release,
approximately 19758 utterances have corresponding orthographic


(2) Chinese <-> English Name Entity Lists (v1.0)
are compiled from Xinhua News Agency articles. This release consists of
9 pairs of bi-directional lists in the following categories: Person
Names, Place Names, Organization Names, Industry Names, Press Names,
Other Names, and Who is Who Names. The English->Chinese version of each
pair was created by reversing the Chinese->English, both sorted by the
Unix built-in sort function.


(3) The West Point Company G3 American English Speech Data Corpus
was produced by Center for Technology Enhanced Language Learning, part
of the U.S. Military Academy's Department of Foreign Languages. During
the 2000-2001 academic year, cadets, staff and faculty members at the
United States Military Academy volunteered to participate in a speech
data collection project for American English. The goal of the project
was to amass recordings from no less than one hundred adult speakers,
fifty males and fifty females, to form a substantial corpus of
high-quality read speech.

The 185 sentences comprising the data collection script were written to
elicit examples of all or most all of the possible syllables used in
spoken American English. The G3 Corpus audio data comes from 53 female
and 56 male volunteers, each of whom recorded approximately 104
utterances. The recordings are sampled at a 16 bit resolution, 22,050
samples per second. Recordings were made using headset microphones
(Shure M10) with preamplifiers attached to the line input jack of
desktop computers. The total amount of speech is about 15 hours.


If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573


Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20051212/613c2dd5/attachment.html

More information about the Corpora-archive mailing list