[Corpora-List] New LDC Corpora

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Jul 7 22:54:00 CEST 2005


LDC2005T20
Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>

LDC2005T10
Chinese English News Magazine Parallel Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10>

LDC2005S14
Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14>

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new corpora.

------------------------------------------------------------------------


Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>
supports the development of data-driven approaches to natural language
processing (NLP), human language technologies, automatic content
extraction (topic extraction and/or grammar extraction), cross-lingual
information retrieval, information detection, and other forms of
linguistic research on Modern Standard Arabic in general. The LDC was
sponsored to develop an Arabic POS and Treebank of 1,000,000 words, and
this corpus is part three of that project. In this release, both
syntactic (treebank) annotation and annotation on part of speech (POS),
gloss, and word segmentation are provided.

The current Arabic Treebank: Part 3 corpus consists of 600 stories from
the An Nahar News Agency. The new features include complete vocalization
of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive.


*


Chinese English News Magazine Parallel Text
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10>
contains Chinese news stories and their English translations drawn from
Sinorama Magazine, Taiwan, from 1976 to 2004. The corpus totals 6,366
story pairs, 365,568 sentence pairs, 20M Chinese characters and 9M
English words. It is aligned at sentence level; the data obtained from
Sinorama Magazine was aligned at the story level. The sentence alignment
was done at the LDC using champollion v1.1. The Sinorama Chinese text is
encoded in Big5.


*

Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14>
contains 901 calls, totaling 133.6 hours of telephone conversation
speech in Levantine Arabic. The majority of speakers in this corpus are
Lebanese. The corpus also includes 901 transcript files is UTF-8 format.
Speaker information files are provided.




------------------------------------------------------------------------


If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
1275.


--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20050707/ee193acf/attachment.html


More information about the Corpora-archive mailing list