[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Feb 23 23:28:53 CET 2009


LDC2009V01* - Audiovisual Database of Spoken American English <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01> - *

LDC2009T03 - *GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1* <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03> -

*- LDC's Corpus Catalog Receives Top OLAC Rating <http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#1>* -

- *2009 Publications Pipeline* <http://www.ldc.upenn.edu/Membership/Agreements/member_announcement.shtml#2> -

------------------------------------------------------------------------

*New Publications*

(1) The Audiovisual Database of Spoken American English <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009V01> was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech production and speech recognition. It contains approximately seven hours of audiovisual recordings of fourteen American English speakers producing syllables, word lists and sentences used in both academic and clinical settings.

All talkers were from the North Midland dialect region -- roughly defined as Indianapolis and north within the state of Indiana -- and had lived in that region for the majority of the time from birth to 18 years of age. Each participant read 238 different words and 166 different sentences. The sentences spoken were drawn from the following sources:

* Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)

* Northwestern University Auditory Test No. 6 (Lists I-IV)

* Vowels in /hVd/ context (separate words)

* Texas Instruments/Massachusetts Institute for Technology (TIMIT)

<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1>

sentences

The Audiovisual Database of Spoken American English will be of interest in various disciplines: to linguists for studies of phonetics, phonology, and prosody of American English; to speech scientists for investigations of motor speech production and auditory-visual speech perception; to engineers and computer scientists for investigations of machine audio-visual speech recognition (AVSR); and to speech and hearing scientists for clinical purposes, such as the examination and improvement of speech perception by listeners with hearing loss.

Participants were recorded individually during a single session with a Panasonic DVC-80 digital video camera to miniDV digital video cassette tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings. Each speaker produced a total of 94 segmented files which were converted from Final Cut Express to Quicktime (.mov) files.

***

(2) GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T03> was prepared by LDC and contains a total of 178,000 words (264 files) of Arabic newsgroup text and its translation selected from thirty-five sources. Newsgroups consist of posts to electronic bulletin boards, Usenet newsgroups, discussion groups and similar forums. This release was used as training data in Phase 1 (year 1) of the DARPA-funded GALE program. Preparing the source data involved four stages of work: data scouting, data harvesting, formatting and data selection.

Data scouting involved manually searching the web for suitable newsgroup text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a database. A nightly process queried the annotation database and harvested all designated URLs. Whenever possible, the entire site was downloaded, not just the individual thread or post located by the data scout. Once the text was downloaded, its format was standardized so that the data could be more easily integrated into downstream annotation processes. Typically, a new script was required for each new domain name that was identified. After scripts were run, an optional manual process corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using a semi-automatic process. A statistical approach was used to rank a document's relevance to a set of already-selected documents labeled as "good." An annotator then reviewed the list of relevance-ranked documents and selected those which were suitable for a particular annotation task or for annotation in general. These newly-judged documents in turn provided additional input for the generation of new ranked lists.

Manual sentence units/segments (SU) annotation was also performed as part of the transcription task. Three types of end of sentence SU were identified: statement SU, question SU, and incomplete SU. After transcription and SU annotation, files were reformatted into a human-readable translation format and assigned to professional translators for careful translation. Translators followed LDC's GALE Translation guidelines which describe the makeup of the translation team, the source data format, the translation data format, best practices for translating certain linguistic features and quality control procedures applied to completed translations.

All final data are presented in Tab Delimited Format (TDF). TDF is compatible with other transcription formats, such as the Transcriber format and AG format making it easy to process.

*LDC's Corpus Catalog Receives Top OLAC Rating*

LDC is pleased to announce that The LDC Corpus Catalog <http://www.ldc.upenn.edu/Catalog/> has been awarded a five-star quality rating, the highest rating available, by the Open Language Archives Community (OLAC) <http://www.language-archives.org/>. OLAC is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. LDC supports OLAC and is among the 37 participating archives who have contributed over 36,000 records to the combined catalog of language resources. OLAC seeks to refine the quality of the metadata in catalog records in order to improve the quality of searching that users can do over that catalog. When resources are described following the best practice guidelines established by OLAC, it increases the likelihood that all the resources returned by a query are relevant (precision) and that all relevant resources are returned (recall).

Certain metadata in the LDC catalog was missing, inaccurate and/or non-compliant with OLAC standards for several fields. Over a period of a few months, a team at LDC took several steps to make that metadata OLAC-compliant. Most significantly, the language name and the language ID for over 400 corpora were reviewed and changed when required to conform to the new standard for language identification, ISO 639-3 <http://www.sil.org/iso639-3/>. Additional efforts focused on providing author information for all corpora and fixing dead links. Finally, the team added a new metadata field to consistently document the "type" of each resource, using a standard vocabulary from the digital libraries community called DCMI-Type, reliably distinguishing text and sound resources. The benefits of these revisions include improving LDC's management of resources in the catalog as well as assisting LDC users to quickly identify all corpora which are relevant to their research.

*2009 Publications Pipeline *

For Membership Year 2009 (MY2009), we anticipate releasing a varied selection of publications. Many publications are still in development, but here is a glimpse of what is in the pipeline for MY2009. Please note that this list is tentative and subject to modifications. Our planned publications include:

/Arabic Gigaword Fourth Edition/ ~ edition includes our recent

newswire collections as well as the contents of Arabic Gigaword

Third Edition (LDC2007T40). In addition to sources found in

previous releases such as Xihhuna, Agence France Presse, An Nahar,

Al Hayat, this release includes data from several new sources, such

as Al Quds, Asharq Al Awasat, and Al Ahram.

/Chinese Gigaword Fourth Edition /~ edition includes our recent

newswire collections as well as the contents of the Chinese Gigaword

Third Edition (LDC2007T38). In addition to sources found in previous

releases such as Agence France Presse, Central News Agency (Taiwan),

Xinhua and Zaobao, this release includes data from several new

sources, such as People's Liberation Army Daily, Guangming Daily,

and China News Service. * *

/Chinese Web 5-gram Corpus Version 1/ ~ contains n-grams (unigrams

to five-grams) and their observed counts in 880 billion tokens of

Chinese web data collected in March 2008. All text was converted to

UTF-8. A simple segmenter using the same algorithm used to generate

the data is included. The set contains 3.9 billion n-grams total.

/CoNLL 2008 Shared Task Corpus/ ~ includes syntactic and semantic

dependencies for Treebank-3 (LDC99T42) data. This corpus was

developed for the 2008 shared task of the Conference on Natural

Language Learning (CoNLL 2008). The syntactic information was

created by converting constituent trees from Treebank-3 to

dependencies using a set of head percolation rules and a series of

other transformations, e.g., named entity boundaries are included

from the BBN Pronoun Coreference and Entity Type Corpus

(LDC2005T33). The semantic dependencies were created by converting

semantic propositions to a dependency representation. The corpus

includes propositions centered around both verbal predicates - from

Proposition Bank I (LDC2004T14) - and around nominal predicates -

from NomBank 1.0 (LDC2008T24).

/English Gigaword Fourth Edition/ ~ edition includes our recent

collections as well as the contents of the English Gigaword Third

Edition (LDC2007T07). The sources of text data include Agence

France Presse, Associated Press, Central News Agency (Taiwan), NY

Times, Xinhua and Salon.com

/GALE Phase 1 Arabic Newsgroup Parallel Text Part 2/ ~ 145K words

(263 files) of Arabic newsgroup text and its English translation

selected from thirty sources. Newsgroups consist of posts to

electronic bulletin boards, Usenet newsgroups, discussion groups and

similar forums. This release was used as training data in Phase 1 of

the DARPA-funded GALE program.

/GALE Phase 1 Chinese Broadcast Conversation Parallel Text Part 2/ ~

total of 24 hours of Chinese broadcast conversation were selected

from three sources, China Central TV (CCTV) Phoenix TV, and Voice of

America. This release was used as training data in Phase 1 of the

DARPA-funded GALE program.

/GALE Phase 1 Chinese Newsgroup Parallel Text Part 1/ ~ 240K

characters (112 files) of Chinese newsgroup text and its English

translation selected from twenty-five sources. Newsgroups consist

of posts to electronic bulletin boards, Usenet newsgroups,

discussion groups and similar forums. This release was used as

training data in Phase 1 of the DARPA-funded GALE program.

/Japanese Web N-gram Corpus Version 1/ ~ contains n-grams (unigrams

to seven-grams) and their observed counts in 250 billion tokens of

Japanese web data collected in July 2007. All text was converted to

UTF-8 and segmented using the publicly available segmenter MeCab.

The set contains 3.2 billion n-grams total.

/NIST MetricsMATR08 Development Data/ ~ contains sample data

extracted from the NIST Open Machine Translation (MT) 2006

evaluation. Data includes the English machine translations from 8

systems and the human reference translations for 25 Arabic source

language newswire documents, along with corresponding human

assessments of adequacy and preference. This data set was

originally provided to NIST MetricsMATR08 participants for the

purpose of MT metric development.

* *

2009 Subscription Members are automatically sent all MY2009 data as it is released. 2009 Standard Members are entitled to request 16 corpora for free from MY2009. Non-members may license most data for research use.

------------------------------------------------------------------------

Ilya Ahtaridis Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 15247 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090223/23d4c70c/attachment.txt



More information about the Corpora mailing list