[Corpora-List] LDC Online and New Corpora

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Mar 30 19:04:00 CEST 2005

** New LDC Online Services <https://online.ldc.upenn.edu/login.html>!
<https://online.ldc.upenn.edu/login.html> **

*** ACE 2004 Multilingual Training Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T09> *
* Chinese News Translation Text Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T06> *

* Discourse Graphbank
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T08> *


The LDC would like to announce the availability of a new LDC Online
service and the release of three new corpora.


The LDC is pleased to announce that an improved LDC Online service is
now available. LDC Online can be accessed at the following url:

https://online.ldc.upenn.edu/login.html <https://online.ldc.upenn.edu/>

Organizations that hold 2005 Membership in the LDC will be able to
perform text searches on our entire English Gigaword corpus. This
corpus is a comprehensive archive of newswire text data that has been
acquired over several years by the LDC. Current members will also be
able to access the American English Spoken Lexicon (AESL). AESL
contains pronunciations in individual audio files for more than 50,000
of the most common words in English

Even if your organization is not a current member, you can access LDC
Online through a guest account. As a guest, an LDC online user will be
able to access the American English Spoken Lexicon.

We will offer periodic updates to LDC Online to include new corpora and
search functions. Please check in with us often as we anticipate this
will be an exciting offering.


ACE 2004 Multilingual Training Corpus
contains the complete set of English, Arabic and Chinese training data
for the 2004 Automatic Content Extraction (ACE) technology evaluation.
The objective of the ACE program is to develop automatic content
extraction technology to support automatic processing of human language
in text form.

Sites were evaluated on system performance in six areas: Entity
Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR
Co-reference, Relation Detection and Recognition (RDR), Relation Mention
Detection (RMD), and RDR given reference entities. All tasks were
evaluated in three languages: English, Chinese and Arabic.


Chinese News Translation Text Part 1
supports the development of automatic machine translation systems, the
LDC was sponsored to solicit English translations for a single set of
Chinese source materials.

The source Chinese text and its English translations were selected and
translated in different LDC projects. A total of about 474K Chinese
characters were selected from two sources, namely Xinhua and AFP, and
translation services were provided by seven translation agencies. Each
Chinese news story was translated once.


Discourse Treebank
aims to define a descriptively adequate data structure for representing
discourse coherence structures.. This project also investigates the
impact of discourse coherence structures on other linguistic processes
and natural language applications (e.g. anaphor
resolution,summarization, information retrieval), to develop and test
discourse parsing algorithms. The data consists of 135 texts from AP
Newswire and Wall Street Journal, annotated with coherence relations.
The source for data is TIPSTER Complete (LDC93T3A).


If you need further information, or would like to inquire about
membership to the LDC, please email ldc at ldc.upenn.edu or call +1 215 573

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20050330/76600c4f/attachment.html

More information about the Corpora-archive mailing list