[Corpora-List] New Releases from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Mar 7 17:26:02 CET 2005


The Linguistic Data Consortium (LDC) would like to announce the
availability of three new corpora.


------------------------------------------------------------------------


(1) ACE Time Normalization (TERN) 2004 English Training Data
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T07>
contains the English training data prepared for the 2004 Time Expression
Recognition and Normalization (TERN) Evaluation. The purpose of this
corpus and the TERN evaluation is to advance the state of the art in the
automatic recognition and normalization of natural language temporal
expressions. In most language contexts such expressions are indexical.
For example, with "Monday", "last week", or "three months starting
October 1", one must know the narrative reference time in order to
pinpoint the time interval being conveyed by the expression.

In addition, for data exchange purposes, it is essential that the
identified interval be rendered according to an established standard,
i.e., normalized. Accurate identification and normalization of temporal
expressions is in turn essential for the temporal reasoning being
demanded by advanced NLP applications such as question answering,
information extraction, and summarization.

(2) Arabic Treebank: Part 1 v 3.0 (POS with full vocalization and
syntactic analysis)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T02>
is a re-release of LDC corpus, Arabic Treebank: Part 1 v 2.0, with the
addition of improved morphological/part-of-speech annotation including
full vocalization and case endings. The corpus supports the development
of data-driven approaches to natural language processing (NLP), human
language technologies, automatic content extraction, cross-lingual
information retrieval, information detection, and other forms of
linguistic research on Modern Standard Arabic.

The project targets the description of a written Modern Standard Arabic
corpus from the Agence France Presse (AFP) newswire archives for
July-November 2000. This corpus includes 734 stories representing 145K
words.

(3) Multiple Translation Arabic (MTA) Part 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T05>
supports the development of automatic means for evaluating translation
quality. The corpus contains 4 sets of human translations and 2 sets of
commercial-off-the-shelf systems (COTS) outputs for a single set of
Arabic source materials. Additionally, there is one output set from a
TIDES 2003 MT Evaluation participant, which is representative for the
state-of-the-art research systems.

To see if automatic evaluation systems, such as BLEU, track human
assessment, the LDC performed human assessment on the two COTS outputs
and the TIDES research system. The corpus includes the assessment
results for one of the two COTS systems, the assessment result for the
TIDES research system, and the specifications used for conducting the
assessments.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573
2175.


--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20050307/269447a7/attachment.html


More information about the Corpora-archive mailing list