[Corpora-List] New Publications from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Jun 25 18:14:29 CEST 2009


LDC2009T15 *- GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15> -*

LDC2009T14 *- Tagged Chinese Gigaword Version 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14> -*

The Linguistic Data Consortium (LDC) would like to announce the availability of two new publications.

------------------------------------------------------------------------ N*ew Publications*

* *(1) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15> contains 240,000 characters (112 files) of Chinese newsgroup text and its translation selected from twenty-five sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE.

Preparing the source data involved four stages of work: data scouting, data harvesting, formating and data selection.

Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout. Once the text was downloaded, its format was standardized so that the data could be more easily integrated into downstream annotation processes. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.

Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU were identified: statement SU, question SU, and incomplete SU. After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features and quality control procedures applied to completed translations.

*

(2) Tagged Chinese Gigaword Version 2.0, <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14> created by scholars at Academia Sinica <http://www.sinica.edu.tw/main_e.shtml>, Taipei, Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition (LDC2005T14) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>. Like the original release, Version 2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In addition, this new release removes residual noises in the original and improves tagging accuracy by incorporating lexica of unknown words. The changes represented in Version 2.0 include the following:

* A single-width space is used consistently between two segmented

words.

* The position of the newline character remains fixed, better

reflecting the source files from Chinese Gigaword Second Edition

(LDC2005T14)

<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>.

* The original coding of partial Latin letters or Arabic numerals is

preserved.

* 1,192 documents from Central News Agency (Taiwan) and 13 documents

from Xinhua News Agency that were missing from the first

publication are included.

* A set of heuristics for building out-of-vocabulary dictionaries to

improve annotation quality of very large corpora is incorporated.

Documents in the corpus were assigned one of the following categories:

* *story*: This type of DOC represents a coherent report on a

particular topic or event, consisting of paragraphs and full

sentences.

* *multi*: This type of DOC contains a series of unrelated

"blurbs," each of which briefly describes a particular topic or

event; examples include "summaries of today's news," "news briefs

in ..." (some general area like finance or sports), and so on.

* *advis*: These are DOCs which the news service addresses to news

editors; they are not intended for publication to the "end users."

* *other*: These DOCs clearly do not fall into any of the above

types; they include items such as lists of sports scores, stock

prices, temperatures around the world, and so on.

Since neither manual checking nor automatic checking against a gold standard is feasible for gigaword size corpora, the authors proposed quality assurance of automatic annotation of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By comparing to word lists generated from the ICTCLAS version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary dictionaries were manually checked. Experimental results indicate that there were 30,562 correct words (about 97.3 %) of tested words.

------------------------------------------------------------------------ Ilya Ahtaridis Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8586 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20090625/06c0257d/attachment.txt>



More information about the Corpora mailing list