[Corpora-List] May 2018 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Tue May 15 16:56:21 CEST 2018


In this newsletter:

New Publications:

Rhythm and Pitch<https://catalog.ldc.upenn.edu/LDC2018S04>

GALE Phase 4 Arabic Broadcast News Speech<https://catalog.ldc.upenn.edu/LDC2018S05>

GALE Phase 4 Arabic Broadcast News Transcripts<https://catalog.ldc.upenn.edu/LDC2018T14> _______________________________________________________________________________

New publications:

(1) Rhythm and Pitch<https://catalog.ldc.upenn.edu/LDC2018S04> contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42<https://catalog.ldc.upenn.edu/LDC97S42>) and Boston University Radio Speech Corpus (LDC96S36<https://catalog.ldc.upenn.edu/LDC96S36>).

The RaP system permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers are used for annotating speech prosody. These tiers carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information. More information about the RaP system is available on the RaP homepage<http://tedlab.mit.edu/tedlab_website/RaPHome.html>.

Speech data are presented as flac compressed 16-bit wav files. The Boston data are one channel 16kHz files, while the CALLHOME data are either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat<http://www.fon.hum.uva.nl/praat/> TextGrids.

Rhythm and Pitch is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. *

(2) GALE Phase 4 Arabic Broadcast News Speech<https://catalog.ldc.upenn.edu/LDC2018S05> was developed by LDC and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News Transcripts (LDC2018T14<https://catalog.ldc.upenn.edu/LDC2018T14>).

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Arabic Broadcast News Speech is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) GALE Phase 4 Arabic Broadcast News Transcripts<https://catalog.ldc.upenn.edu/LDC2018T14> was developed by LDC and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News Speech (LDC2018S05<https://catalog.ldc.upenn.edu/LDC2018S05>).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The transcripts were created with the LDC tool XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Arabic Broadcast News Transcripts is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Membership Office Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9267 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180515/b5f266ff/attachment.txt>



More information about the Corpora mailing list