[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed May 23 17:39:02 CEST 2012

/New publications:/

LDC2012T05* *- *Chinese Dependency Treebank 1.0 <#depend> * - *

*LDC2012T06 * *- *GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 <#gale>** -*

**LDC2012S06 * *<imap://ldc@imap.ldc.upenn.edu:993/fetch%3EUID%3E/INBOX%3E12993#turk>- *Turkish Broadcast News Speech and Transcripts* <#turk> -

------------------------------------------------------------------------ *New Publications*

(1) Chinese Dependency Treebank 1.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T05> was developed by the Harbin Institute of Technology's <http://en.hit.edu.cn/> Research Center for Social Computing and Information Retrieval <http://ir.hit.edu.cn/english/> (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from People's Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures. Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds. Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of People's Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8.


(2) GA <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06>LE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T06> was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising 169,109 words of Arabic source text and its English translation. Data is drawn from thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and Radio Sawa. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription <http://projects.ldc.upenn.edu/gale/Transcription/Arabic-XTransQRTR.V2.pdf> guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines which are included with this release. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. All data are encoded in UTF8.


(3) T <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06>urkish Broadcast News Speech and Transcripts <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012S06> was developed by Bog(aziši University <http://www.boun.edu.tr/en-US/Content/About_BU/History.aspx>, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval.

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions. A quick manual segmentation and transcription approach was followed.

The data was recorded at 32 kHz and re-sampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries.

The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data. Manual segmentation and transcripts were created by native Turkish speakers at Bog(aziši University using Transcriber <http://trans.sourceforge.net/en/presentation.php>. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.


Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8300 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120523/7a366ab0/attachment.txt>

More information about the Corpora mailing list