[Corpora-List] September 2021 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Wed Sep 15 17:50:13 CEST 2021


In this newsletter: New Publications: RATS Speaker Identification<https://catalog.ldc.upenn.edu/LDC2021S08> Classical Arabic Dictionary<https://catalog.ldc.upenn.edu/LDC2021L01> DiscAlign for Penn and RST Discourse Treebanks<https://catalog.ldc.upenn.edu/LDC2021T16> ________________________________ New publications: (1) RATS Speaker Identification<https://catalog.ldc.upenn.edu/LDC2021S08> was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto, and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, for 17,000 hours of total speech. The corpus was created to provide training and development sets for the speaker identification task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi, and Dari native speakers. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, speaker ID, speaker ID provenance, language ID, and language ID provenance.

RATS Speaker Identification is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (2) Classical Arabic Dictionary<https://catalog.ldc.upenn.edu/LDC2021L01> consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents, and related metadata.

The dictionary is presented in three formats: plain text in UTF-8 encoding, plain text in CP1256 encoding, and as an SQL database file. Source documents are presented in UTF-8 and CP1256 encodings.

Classical Arabic Dictionary is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (3) DiscAlign for Penn and RST Discourse Treebanks<https://catalog.ldc.upenn.edu/LDC2021T16> was developed by Saarland University. It consists of alignment information for the discourse annotations contained in Penn Discourse Treebank Version 2.0 (LDC2008T05) <https://catalog.ldc.upenn.edu/LDC2008T05> (PDTB 2.0) and RST Discourse Treebank (LDC2002T07)<https://catalog.ldc.upenn.edu/LDC2002T07> (RST-DT). PDTB 2.0 and RST-DT annotations overlap for 385 newspaper articles in sections 6, 11, 13, 19 and 23 of the Wall Street Journal corpus contained in Treebank-2 (LDC95T7)<https://catalog.ldc.upenn.edu/LDC95T7>. DiscAlign for Penn and RST Discourse Treebanks contains approximately 6,700 alignments between PDTB 2.0 and RST-DT relations.

DiscAlign for Penn and RST Treebanks is available at no cost to all licensees of PDTB 2.0 and RST-DT and appears in their download queues associated with these corpora as DiscAlign_Penn_RST_DTB_LDC2021T16.zip. *

Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7480 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20210915/79c581c8/attachment.txt>



More information about the Corpora mailing list