[Corpora-List] June 2019 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Mon Jun 17 18:17:55 CEST 2019


In this newsletter: New Publications: DEFT Spanish Committed Belief Annotation<https://catalog.ldc.upenn.edu/LDC2019T09> USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition<https://catalog.ldc.upenn.edu/LDC2019S11> First DIHARD Challenge Development - Eight Sources<https://catalog.ldc.upenn.edu/LDC2019S09> First DIHARD Challenge Development - SEEDLingS<https://catalog.ldc.upenn.edu/LDC2019S10>

New publications:

(1) DEFT Spanish Committed Belief Annotation<https://catalog.ldc.upenn.edu/LDC2019T09> was developed by LDC and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text. DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources. DEFT Spanish Committed Belief Annotation is distributed via web download. 2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (2) USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition<https://catalog.ldc.upenn.edu/LDC2019S11> was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project<http://malach.umiacs.umd.edu/> and contains approximately 168 hours of interviews from 682 Holocaust witnesses along with transcripts, a lexicon and other documentation. This release augments USC-SFI MALACH Interviews and Transcripts English (LDC2012S05<https://catalog.ldc.upenn.edu/LDC2012S05>) by modifying and updating a subset of the original corpus for use with speech recognition systems, such as the Kaldi<https://kaldi-asr.org/> toolkit. Specifically, the audio data has been converted from unsegmented mpeg files to a segmented flac compressed format. The speaker-turn, time-stamped transcripts have been updated to an utterance-by-utterance format. A lexicon mapping words to phonemes is provided, and the data is divided into development and training sets. The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives in order to advance the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching, and emotional speech -- were considered well-suited for that task. USC-SFI MALACH Interviews and Transcripts English - Speech Recognition Edition is distributed via web download. 2019 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost. * (3) First DIHARD Challenge Development - Eight Sources<https://catalog.ldc.upenn.edu/LDC2019S09> was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge<https://coml.lscp.ens.fr/dihard/2018/index.html>. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10<https://catalog.ldc.upenn.edu/LDC2019S10>), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool. The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):

* Autism Diagnostic Observation Schedule (ADOS) interviews

* DCIEM/HCRC map task (LDC96S38<https://catalog.ldc.upenn.edu/LDC96S38>)

* Audiobook recordings from LibriVox<https://librivox.org/>

* Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11<https://catalog.ldc.upenn.edu/LDC2007S11>) and Evaluation (LDC2007S12<https://catalog.ldc.upenn.edu/LDC2007S12>) releases.

* 2001 U.S. Supreme Court oral arguments

* Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15<https://catalog.ldc.upenn.edu/LDC2003T15>)

* Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project

* YouthPoint radio interviews

First DIHARD Challenge Development - Eight Sources is distributed via web download. 2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (4) First DIHARD Challenge Development - SEEDLingS<https://catalog.ldc.upenn.edu/LDC2019S10> was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge<https://coml.lscp.ens.fr/dihard/2018/index.html>. This release, when combined with First DIHARD Challenge Development - Eight Sources (LDC2019S09<https://catalog.ldc.upenn.edu/LDC2019S09>), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool. The source data was drawn from the SEEDLingS<https://homebank.talkbank.org/access/Password/Bergelson.html> (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York, area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge. The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions. First DIHARD Challenge Development - SEEDLingS is distributed via web download. 2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. *

Membership Office Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 16894 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190617/646543eb/attachment.txt>



More information about the Corpora mailing list