[Corpora-List] News from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Jun 25 22:07:54 CEST 2015


/In this newsletter:/

*Customize a Data Pack from 2013 publications <#pack>*

/New publications: /

*CIEMPIESS <#ciem>** *

*GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences <#gale>** *

*RST Signalling Corpus <#rst>*

------------------------------------------------------------------------

*Customize a Data Pack from 2013 publications** ***

There is still time for not-for-profit and government organizations to create a custom data collection of eight corpora from among LDC's 2013 releases. Selection options include: 1993-2007 United Nations Parallel Text, Chinese Treebank 8.0, CSC Deceptive Speech, GALE Arabic and Chinese speech and text releases, Greybeard, MADCAT training data, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, and more. The 2013 Data Pack <https://catalog.ldc.upenn.edu/LDC2015MDP> is available for a flat rate of $3500 through September 15, 2015.

To license the Data Pack and select eight corpora, login or register for an LDC user account <https://catalog.ldc.upenn.edu/login> and add the 2013 Data Pack <https://catalog.ldc.upenn.edu/LDC2015MDP> and each of the eight data sets to your bin. Follow the check-out procedure, sign all applicable user agreements and select payment via wire transfer, purchase order or check. LDC will adjust the invoice total to reflect the data pack fee.

To pay via credit card, add the 2013 Data Pack <https://catalog.ldc.upenn.edu/LDC2015MDP> to your bin and check out using the system prompts. At the completion of the transaction, send an email to ldc at ldc.upenn.edu <mailto:ldc at ldc.upenn.edu> indicating the eight data sets to include in your order.

*New publications:*

(1) CIEMPIESS <https://catalog.ldc.upenn.edu/LDC2015S07> (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory <http://odin.fi-b.unam.mx/profesores/abelherrera/> of the Faculty of Engineering at the National Autonomous University of Mexico <http://www.unam.mx/> (UNAM) and consists of approximately 18 hours of Mexican Spanish radio speech, associated transcripts, pronouncing dictionaries and language models. The goal of this work was to create acoustic models for automatic speech recognition.

For more information and documentation see the CIEMPIESS-UNAM Project website <http://www.ciempiess.org/>.

The speech recordings are from 43 one-hour FM radio programs broadcast by Radio IUS <http://www.derecho.unam.mx/cultura-juridica/radio.php>, a UNAM radio station. They are comprised of spontaneous conversations between a radio moderator and guests, principally about legal issues. Approximately 78% of the speakers were males, and 22% of the speakers were females.

The recordings were transcibed using PRAAT <http://www.fon.hum.uva.nl/praat/>, a tool designed for phonetics research. The transcripts are in Mexbet, a phonetic alphablet designed for Mexican Spanish based on Worldbet (Hieronymus, 1994). Plain text transcripts, textgrid format time labels and files useful for performing experiments with the SPHINX3 <http://www.cs.cmu.edu/%7Earchan/sphinxInfo.html> recognition software are also included.

Non-members may license this data at no-cost under the LDC User Agreement for Non-Members <https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf>.

*

(2) GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences <https://catalog.ldc.upenn.edu/LDC2015T14> was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from broadcast conversation data collected by LDC in 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences includes 109 source-translation document pairs, comprising 63,829 tokens of Chinese source text and its English translation. Data is drawn from 17 distinct Chinese programs broadcast in 2008 from Beijing TV, China Central TV, Hubei TV and Voice of America.. Broadcast conversation programming is more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

*

(3) RST Signalling Corpus <https://catalog.ldc.upenn.edu/LDC2015T10> was developed at Simon Fraser University and contains annotations for signalling information added to RST Discourse Treebank (LDC2002T07 <https://catalog.ldc.upenn.edu/LDC2002T07>). RST Discourse Treebank (RST-DT) is a collection of English news texts annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Signalling Corpus, information about textual signals -- such as although, because, thus -- and signals such as tense, lexical chains or punctuation were added as an annotation layer to examine how rhetorical relations are signalled in discourse.

The source data consists of 385 Wall Street Journal news articles from the Penn Treebank <https://catalog.ldc.upenn.edu/LDC99T42> annotated for rhetorical relations in RST Discourse Treebank. As in RST-DT, the data in this release is divided into a training set (347 articles) and a test set (38 articles).

The signalling annotation in this data set was performed using the UAM CorpusTool <http://www.wagsoft.com/CorpusTool/>version 2.8.12. Files are presented as UTF-8 encoded XML and plain text. The corpus is divided into three annotation sub-directories: training, test and full. All sub-directories include source, metadata, signalling annotation, and dtd files.

------------------------------------------------------------------------

-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9980 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150625/eeac90ce/attachment.txt>



More information about the Corpora mailing list