[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Oct 22 21:31:41 CEST 2014

/New publications:/*

- Chinese Discourse Treebank 0.5 <#chinese> -

- GALE Arabic-English Word Alignment -- Broadcast Training Part 2 <#gale> -

- United Nations Proceedings Speech <#un> -* ------------------------------------------------------------------------ *New publications*

(1) Chinese Discourse Treebank 0.5 <https://catalog.ldc.upenn.edu/LDC2014T21> was developed at Brandeis University as part of the Chinese Treebank Project <http://www.cs.brandeis.edu/%7Eclp/ctb/>and consists of approximately 73,000 words of Chinese newswire text annotated for discourse relations. It follows the lexically grounded approach of the Penn Discourse Treebank (PDTB) (LDC2008T05 <https://catalog.ldc.upenn.edu/LDC2008T05>) with adaptations based on the linguistic and statistical characteristics of Chinese text. Discourse relations are lexically anchored by discourse connectives (e.g., because, but, therefore), which are viewed as predicates that take abstract objects such as propositions, events and states as their arguments. Along with PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese Discourse Treebank provides an additional perspective on how the PDTB approach can be extended for cross-lingual annotation of discourse relations.

Data was selected from the newswire material in Chinese Treebank 8.0 (LDC2013T21 <https://catalog.ldc.upenn.edu/LDC2013T21>), specifically, from Xinhua News Agency stories. There are approximately 5,500 annotation instances. Following the PDTB format, each annotation instance consists of 27 vertical bar delimited fields. The fields specify the attributes of the discourse relation as a whole, as well as the attributes of its two arguments. Not all fields are filled in this release. Filled fields are indicated by a pair of angle brackets; the remaining fields are place holders for future releases.


(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 2 <https://catalog.ldc.upenn.edu/LDC2014T22> was developed by LDC and contains 215,923 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009.The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases


(3) United Nations Proceedi <https://catalog.ldc.upenn.edu/LDC2014S08>ngs Speech <https://catalog.ldc.upenn.edu/LDC2014S08> was developed by the United Nations <http://www.un.org/> (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly <http://www.un.org/en/ga/> (GA) and First Committee <http://www.un.org/en/ga/first/> (FC) (Disarmament and International Security), and meetings 6434-6763 of the Security Council <http://www.un.org/en/sc/>.

Recordings were made using a customized system following a daily internal circulated instruction from the Meetings Management Section <http://www.un.org/depts/DGACM/mms.shtml>. Most of the subjects and information related to a particular meeting or session are published in a UN Journal which can be found in the following here <http://www.un.org/en/documents/journal.asp>.

Data is presented either as mp3 or flac compressed wav and are 16-bit single channel files in either 22,050 or 8,000 Hz organized by committee and session number, then language. The folder labeled "Floor" indicates the microphone used by the particular speaker. Those files may include other languages, for instance, if the speaker's language was not among the six official UN languages.


-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7140 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141022/4cb8386a/attachment.txt>

More information about the Corpora mailing list