[Corpora-List] January 2020 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Wed Jan 15 18:23:18 CET 2020


In this newsletter:

Renew Your LDC Membership Today LREC Workshop for Citizen Linguistics - Call for Papers

New Publications:

Abstract Meaning Representation (AMR) Annotation Release 3.0<https://catalog.ldc.upenn.edu/LDC2020T02> Database of Word Level Statistics - Mandarin<https://catalog.ldc.upenn.edu/LDC2020L01> LibriVox Spanish<https://catalog.ldc.upenn.edu/LDC2020S01>

Renew Your LDC Membership Today Join LDC for MY2020 while membership savings are still available. Now through March 2, 2020, renewing MY2019 members receive a 10% discount off the 2020 membership fee. New or returning member organizations receive a 5% discount. This year's planned publications include Mixer 4 and 5 Speech (English telephone speech and interviews), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), and data from BOLT, DEFT, RATS, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership options and benefits.

LREC Workshop on Citizen Linguistics LDC researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development<https://sites.google.com/view/cllrd-2020> at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org<http://LanguageARC.org> citizen linguistics portal, and a special session on best papers using LanguageARC.

New publications:

(1) Abstract Meaning Representation (AMR) Annotation Release 3.0<https://catalog.ldc.upenn.edu/LDC2020T02> was developed by LDC, SDL/Language Weaver, Inc.<https://www.sdl.com/software-and-services/translation-software/machine-translation/>, the University of Colorado's Computational Language and Educational Research<https://www.colorado.edu/lab/clear/> group, and the Information Sciences Institute<http://www.isi.edu/home> at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction, and web text. This release updates Abstract Meaning Representation 2.0 (LDC2017T10<https://catalog.ldc.upenn.edu/LDC2017T10>) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

Abstract Meaning Representation (AMR) Annotation Release 3.0 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (2) Database of Word Level Statistics - Mandarin<https://catalog.ldc.upenn.edu/LDC2020L01> was developed by The Hong Kong Polytechnic University<https://www.polyu.edu.hk/web/en/home/index.html>. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particularly concerned with language processing of isolated words. Invariant characteristics include each item's lexicality, sampa, pinyin, IPA transcription, lexical tone, syllable structure, syllable length, pinyin length, segment length, dominant PoS, lexical frequency of the dominant PoS, percent of that dominant PoS, and other PoSes associated with the given item.

Database of Word Level Statistics - Mandarin is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (3) LibriVox Spanish<https://catalog.ldc.upenn.edu/LDC2020S01> consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox<https://librivox.org/>, a non-profit project that creates audiobooks from public domain works. The transcripts were developed for this release.

The audio is comprised of sentences from 300 books read by 154 speakers (77 men and 77 women), representing native and non-native Spanish read speech. Audio files were manually segmented and are between three and ten seconds in length. Native Spanish speakers transcribed the audio data.

LibriVox Spanish is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. *

Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9887 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200115/54ed8f5f/attachment.txt>



More information about the Corpora mailing list