[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Apr 29 17:42:27 CEST 2010

/New Publications:/

LDC2010T08* - Arabic Treebank: Part 3 v 3.2 <#atb>** -*

LDC2010T06 *- Chinese Web 5-gram Version 1 <#web>** -*

------------------------------------------------------------------------ *New Publications


(1) Arabic Treebank: Part 3 v 3.2 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T08> consists of 599 distinct newswire stories from the Lebanese publication An Nahar with part-of-speech (POS), morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic Treebank (PATB) Guidelines <http://projects.ldc.upenn.edu/ArabicTreebank/> developed in 2008 and 2009. This release represents a significant revision of LDC's previous ATB3 publications: Arabic Treebank: Part 3 v 1.0 LDC2004T11 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11> and Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis LDC2005T20 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>.

ATB3 v 3.2 contains a total of 339,710 tokens before clitics are split, and 402,291 tokens after clitics are separated for the treebank annotation. This release includes all files that were previously made available to the DARPA GALE program <http://projects.ldc.upenn.edu/gale/index.html> community (Arabic Treebank Part 3 - Version 3.1, LDC2008E22). A number of inconsistencies in the 3.1 release data have been corrected here. These include changes to certain POS tags with the resulting tree changes. As a result, additional clitics have been separated, and some previously incorrectly split tokens have now been merged.

One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this corpus as that text is an exact duplicate of another file in this release (ANN20020715.0018). This reduces the number of files from 600 files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.

[ top <#top>]


(2) Chinese Web 5-gram Version 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T06>* *contains Chinese word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., for segmentation, machine translation), as well as for other uses. Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data.

N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data. Data collection took place in March 2008. This means that no text that was created on or after April 1, 2008 was used.

The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data are tokenized by an automatic tool, and all continuous Chinese character sequences are sent to the segmenter for segmentation.

The following types of tokens are considered valid:

* A Chinese word containing only Chinese characters.

* Numbers, e.g., 198, 2,200, 2.3, etc.

* Single Latin tokens, such as Google, & ab, etc.

[ top <#top>]


Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5662 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20100429/3790e8d7/attachment.txt>

More information about the Corpora mailing list