[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Apr 22 22:10:14 CEST 2014


/New publications:/

*- Domain-Specific Hyponym Relations <#domain>**- **** **- GALE Arabic-English Parallel Aligned Treebank -- Web Training <#gale>** - **** **- Multi-Channel WSJ Audio <#wsj> -***

------------------------------------------------------------------------

*New publications *

(1) Domain-Specific Hyponym Relations <http://catalog.ldc.upenn.edu/LDC2014T07> was developed by the Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology at Xi'an Jiaotung University <http://www.xjtu.edu.cn/en/>, Xi'an, Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.

A hyponym relation is a word sense relation that is an IS-A relation. For example, dog is a hyponym of animal and binary tree is a hyponym of tree structure. Among the applications for domain-specific hyponym relations are taxonomy and ontology learning, query result organization in a faceted search and knowledge organization and automated reasoning in knowledge-rich applications.

The data is presented in XML format, and each file provides hyponym relations in one domain. Within each file, the term, Wikipedia URL, hyponym relation and the names of the hyponym and hypernym words are included. The distribution of terms and relations is set forth in the table below:

Dataset

Terms

Hyponym Relations

Data Mining

278

364

Computer Network

336

399

Data Structure

315

578

Euclidean Geometry

455

690

Microbiology

1,028

3,533

This data is made available at no-cost under the Creative Commons Attribution-Noncommercial Share Alike 3.0 <http://creativecommons.org/licenses/by-nc-sa/3.0/> license.

*

(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training <http://catalog.ldc.upenn.edu/LDC2014T08> was developed by LDC and contains 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned.

LDC previously released Arabic-English Parallel Aligned Treebanks as follows:

* Newswire <http://catalog.ldc.upenn.edu/LDC2013T10>

* Broadcast News Part 1 <http://catalog.ldc.upenn.edu/LDC2013T14>

* Broadcast News Part 2 <http://catalog.ldc.upenn.edu/LDC2014T03>

This release consists of Arabic source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language

Files

Words

Tokens

Segments

Arabic

162

46,710

69,766

3,178

Note: Word count is based on the untokenized Arabic source, token count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:

* Identifying different types of links: translated (correct or

incorrect) and not translated (correct or incorrect)

* Identifying sentence segments not suitable for annotation, e.g.,

blank segments, incorrectly-segmented segments, segments with

foreign languages

* Tagging unmatched words attached to other words or phrases

*

(3) Multi-Channel WSJ Audio <http://catalog.ldc.upenn.edu/LDC2014S03> was developed by the Centre for Speech Technology Research <http://www.cstr.ed.ac.uk/> at the University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.

This corpus was designed to address the challenges of speech recognition in meetings, which often occur in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. Using headset microphones represents one approach, but meeting participants may be reluctant to wear them. Microphone arrays are another option. MCWSJ supports research in large vocabulary tasks using microphone arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read News <http://catalog.ldc.upenn.edu/LDC95S24>, a corpus originally developed for large vocabulary continuous speech recognition experiments, which in turn was based on CSR-I (WSJ0) Complete <http://catalog.ldc.upenn.edu/LDC93S6A>, made available by LDC to support large vocabulary continuous speech recognition initiatives.

Speakers reading news text from prompts were recorded using a headset microphone, a lapel microphone and an eight-channel microphone array. In the single speaker scenario, participants read from six fixed positions. Fixed positions were assigned for the entire recording in the overlapping scenario. For the moving scenario, participants moved from one position to the next while reading.

Fifteen speakers were recorded for the single scenario, nine pairs for the overlapping scenario and nine individuals for the moving scenario. Each read approximately 90 sentences.

------------------------------------------------------------------------

-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13849 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140422/8572c3cf/attachment.txt>



More information about the Corpora mailing list