[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Nov 25 23:51:49 CET 2008


*LDC Spoken Language Sampler Available for Free Download* <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>

LDC2008S09 *- CHAracterizing INdividual Speakers (CHAINS) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09> -*

LDC2008T20 *- **PennBioIE CYP 1.0* <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20>* -*

LDC2008T21 *- PennBioIE Oncology 1.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21>** -

**The Linguistic Data Consortium (LDC) would like to announce the availability of a free spoken language sampler as well as the release of three new publications.*

* * ------------------------------------------------------------------------

* * *LDC Spoken Language Sampler Available for Free Download*

The LDC Spoken Language Sampler <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08> provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC's Catalog. Created for distribution at NWAV 37 and geared towards sociolinguists, the sampler is a good introduction to data available from the LDC. The sampler includes excerpts from telephone conversations in Arabic (Gulf, Iraqi, and Levantine dialects) Farsi, Japanese, Korean, Spanish, and Tamil; dictionary resources for Mawukakan and Tamil; transcribed meeting speech; utterances in Russian from native and non-native speakers; and speech samples which represent regional accents and dialects of the United States. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.

The sampler can be downloaded for free from the catalog page for the LDC Spoken Language Sampler <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S08>. Please scroll down to 'How to Obtain' for a download link.

* * *New Publications* * *

(1) CHAracterizing INdividual Speakers (CHAINS) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008S09> contains recordings of thirty-six English speakers reading fables and selected sentences in different speaking styles. The data was obtained in two different sessions with a time separation of about two months. The goal of the corpus is to provide a range of speaking styles and voice modifications for speakers sharing the same accentOther existing corpora, in particular CSLU Speaker Recognition Version 1.1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>, TIMIT <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1> and the IViE corpus <http://www.phon.ox.ac.uk/IViE/> (English Intonation in the British Isles), served as referents in the selection of material. This design decision was made to ensure that methods designed and evaluated on the CHAINS corpus might be directly testable on these other corpora, which were recorded using quite different dialects and channel characteristics.

The data was collected in two recording sessions in a total of six different speaking styles:

* solo reading

* synchronous reading

* spontaneous speech ("retell")

* repetitive synchronous imitation ("rsi")

* whispered fast reading

* fast speech reading

In two of the speaking conditions adopted, speakers modified their speech in a constrained fashion towards a known target; in the synchronous condition, the speech of the co-speaker served as a target, while in rsi, there was an explicit known static target. The presence of a known target which speakers aim to copy raises the bar in the discovery and design of procedures for automatic speaker identication, as the target speech provides a potentially highly confusing foil. The whisper and fast speech conditions are also well defined speaking styles which require substantial voice modification by the speaker.

***

(2) The PennBioIE CYP <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T20> corpus consists of 1100 PubMed <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on the inhibition of cytochrome P450 enzymes. The abstracts comprise approximately 313,000 total words of text. Each file has been tokenized and its biomedical portions (274,000 total words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 5 types of biomedical named entity in three categories of interest. 324 of the abstracts have also been syntactically annotated.

Annotation at all layers except entity is based on the Penn Treebank II guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, annotations being made in a separate file. Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual.

*

(3) The PennBioIE Oncology <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T21> corpus consists of 1414 PubMed <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi> abstracts on cancer, concentrating on molecular genetics. The abstracts comprise approximately 381,000 total words of text. Each file has been tokenized and its biomedical portions (327,000 total words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 16 ("Level 1") or 23 ("Level 2") types of named entity. 318 of the abstracts have also been syntactically annotated.

Annotation at all layers except entity is based on the Penn Treebank II guidelines <ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/>, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, annotations being made in a separate file. Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected; entity annotation is manual.

The oncology data comprises two subcorpora:

* The Sanger subcorpus /(san)/ consists of abstracts of 577 articles

previously annotated by the Sanger Institute for global mention of

oncological named entities. These annotations were metadata

reflecting the presence or absence of such mentions anywhere in

the text. The articles concentrate on variations in a small set of

human genes associated with many different types of cancer. We did

not refer to the Sanger annotations after selection of the abstracts.

* The neuroblastoma subcorpus /(nb)/ consists of 837 abstracts of

articles dealing with this particular type of cancer selected by

colleagues at Children's Hospital of Philadelphia. They do not all

concentrate on genetics, but they mention a much larger number of

genes than the Sanger files do.

------------------------------------------------------------------------

Ilya Ahtaridis Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8965 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081125/70be79df/attachment.txt



More information about the Corpora mailing list