[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Nov 28 16:26:24 CET 2007


*- Free Google Data (Web 1T 5-gram) Available <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13> - *

LDC2007T40 *- Arabic Gigaword Third Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40>* -

LDC2007S18* - CSLU Kid's Speech Version 1.1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18> - *

LDC2007T20 *- GALE Phase 1 Distillation Training <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20> -

*

*The Linguistic Data Consortium (LDC) is pleased to announce the availability of free Web 1T 5-gram data as well as the release of three new publications. *

------------------------------------------------------------------------

*Free Google Data (Web 1T 5-gram) Available *

We are pleased to announce that Google Inc. is once again providing financial support for the distribution of its Web 1T 5-gram (LDC2006T13) corpus to universities. As a result, LDC will make the corpus available at no charge to 100 non-member universities requesting a copy. Shipping and handling fees are also being covered by Google. We appreciate Google's continued generosity and its interest in supporting language research.

To obtain a free copy, universities will need to sign and submit a copy of the User License Agreement for Web 1T 5-gram Version <http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>1 <http://www.ldc.upenn.edu/Catalog/nonmem_agree/Web_1T_5gram_V1_User_Agreement.html>* *. This can be faxed to +1 215 573 2175 or scanned and emailed to ldc at ldc.upenn.edu. Complete contact details, including shipping address, phone number, and email are also required.

*New Publications

*

(1) Arabic Gigaword Third Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T40> is a comprehensive archive of newswire text data acquired from Arabic news sources by the LDC at the University of Pennsylvania. Arabic Gigaword Third Edition includes all of the content of Arabic Gigaword Second Edition (LDC2006T02) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02> as well as new data collected after the publication of that edition. Also, an archive from a new newswire source -- Assabah -- has been included in the third edition.

The six distinct sources of Arabic newswire represented in the third edition are:

* Agence France Presse (afp_arb)

* Assabah (asb_arb)

* Al Hayat (hyt_arb)

* An Nahar (nhr_arb)

* Ummah Press (umh_arb)

* Xinhua News Agency (xin_arb)

The seven-character codes in the parantheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.

The epochs and document counts for the data in the third edition are set forth below:

Newly Added Data

Source

Date Span

Document Count

Agence France Presse

2005.01 - 2006.12

137815

Assabah News Agency

2004.09 - 2006.12

15410

(new source)

Al Hayat News Agency

2005.01 - 2006.1

8799

(no data for 2004)

An Nahar News Agency

2005.01 - 2006.12

104950

(no data for 2004)

Xinhua News Agency

2005.01 - 2006.12

135472

This release contains 547 files, totaling approximately 1.8GB in compressed form (6,673 MB uncompressed) and 1,994,735 K-words.

***

(2) CSLU: Kids' Speech Version 1.1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S18> is a collection of spontaneous and prompted speech from 1100 children between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. All children -- approximately 100 children at each grade level -- read approximately 60 items from a total list of 319 phonetically-balanced but simple words, sentences or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in duration. This release consists of 1017 files containing approximately 8-10 minutes of speech per speaker. Corresponding word-level transcriptions are also included.

This corpus was developed to facilitate research about the characteristics of children's speech at different ages and to train and evaluate recognizers for use in language training and other interactive tasks involving children, including to train recognizers used in language development with deaf children. Information about the subject's age, gender, languages spoken and physical conditions affecting speech was also collected.

***

(3) GALE Phase 1 Distillation Training <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T20> constitutes the final release of training data created by LDC for the DARPA GALE Program Phase 1 Distillation technology evaluation. Distillation is one of three primary technology components for the DARPA GALE Program, along with Transcription and Translation. Distillation engines respond to queries from English-speaking users, delivering pertinent, consolidated information in easy-to-understand forms. The distillation engine processes English and foreign language material, both speech and text, from multiple sources and documents, removing redundancy and presenting an integrated response to the user.

This release consists of 248 English, Chinese and/or Arabic queries and their responses created by LDC annotators. Queries conform to one of ten template types. Query responses may include document and snippet relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been annotated for all features, while the remainder are labeled for only some features.

The annotation task involves responding to a series of user queries. For each query, annotators first find relevant documents and identify snippets (strings of contiguous text that answer the query) in the Arabic, Chinese or English source document. Annotators then create a nugget for each fact expressed in the snippet. Semantically equivalent nuggets are grouped into cross-language, cross-document "supernugs".

Queries in this release have been annotated for the following tasks:

* searching for relevant documents and providing yes/no judgments

* extracting snippets

* resolution of pronouns, and certain types of temporal and locative

expressions contained in the snippets

* creating nuggets, i.e. atomic pieces of information that an

annotator considers a valid answer to the query

* building nugs, i.e. clusters of semantically-equivalent nuggets

for each language

* building supernugs, i.e. clusters of semantically-equivalent nugs

across languages

------------------------------------------------------------------------

Ilya Ahtaridis Membership Coordinator --------------------------------------------------------------------

* Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.uib.no/mailman/public/corpora/attachments/20071128/531b3683/attachment.html



More information about the Corpora mailing list