[Corpora-List] December 2019 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Thu Dec 5 21:55:33 CET 2019


In this newsletter: LDC Membership Discounts for MY2020 Still Available Spring 2020 Data Scholarship Program - deadline approaching Introducing LanguageArc: A Citizen Linguist Portal

New Publications: Magic Data Chinese Mandarin Conversational Speech<https://catalog.ldc.upenn.edu/LDC2019S23> BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training<https://catalog.ldc.upenn.edu/LDC2019T18> TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017<https://catalog.ldc.upenn.edu/LDC2019T19> ________________________________

LDC Membership Discounts for MY2020 Still Available Join LDC while membership savings are still available. Now through March 2, 2020, current MY2019 members who renew their LDC membership receive a 10% discount off the membership fee. New or returning member organizations receive a 5% discount through March 2. Membership remains the most economical way to access LDC releases. Visit Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership options and benefits.

Spring 2020 Data Scholarship Program - deadline approaching Students can apply for the Spring 2020 Data Scholarship Program now through January 15, 2020. The LDC Data Scholarship program provides students with no-cost access to LDC data. For more information on application requirements and program rules, please visit LDC Data Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

Introducing LanguageArc: A Citizen Linguist Portal LanguageARC<https://languagearc.com> is a citizen science website for languages developed with a grant from the National Science Foundation (no. 170377). Contributors to this online community - "citizen linguists" - participate in a variety of tasks and activities that support linguistic research, such as identifying accents from audio clips, recording "tongue twisters," and translating English sentences into other languages. Data collected from LanguageArc will be made freely available to the research community. New collection and annotation projects will be added on an ongoing basis, and researchers will soon be able to create their own LanugageArc projects with an easy-to-use Project Builder Toolkit. All are encouraged to explore the site and participate in the community. Comments, questions and suggestions are welcome via the site's Contact<https://www.languagearc.com/messages/new> page. ________________________________

New publications:

(1) Magic Data Chinese Mandarin Conversational Speech<https://catalog.ldc.upenn.edu/LDC2019S23> was developed by Beijing Magic Data Technology Co., Ltd.<http://en.imagicdatatech.com/> and consists of approximately 10 hours of Mandarin conversational speech from 60 speakers. Each conversation was recorded on multiple devices and is presented in multiple forms, resulting in a total of approximately 60 hours of audio with corresponding transcripts.

All participants were native speakers of Mandarin in Mainland China from accent regions across the country. Speakers were paired for conversations on a range of topics, including travel, fitness, games, sports, and pets. Metadata such as topic, collection date, mobile device, and speaker demographic information is available in the documentation accompanying this release.

Magic Data Chinese Mandarin Conversational Speech is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (2) BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training<https://catalog.ldc.upenn.edu/LDC2019T18> was developed by LDC and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

This release contains Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07<https://catalog.ldc.upenn.edu/LDC2017T07>).

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC's BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * (3) TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017<https://catalog.ldc.upenn.edu/LDC2019T19> was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2016<http://tac.nist.gov/2016/KBP/index.html> and 2017<http://tac.nist.gov/2017/KBP/index.html>. This corpus includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. The EDL reference KB, to which EDL data are linked, is available separately in TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 (LDC2019T02<https://catalog.ldc.upenn.edu/LDC2019T02>).

The goal of the EDL track is to conduct end-to-end entity extraction, linking and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB.

Source data for the annotations consists of Chinese, English and Spanish newswire and discussion forum text collected by LDC and is available in TAC KBP Evaluation Source Corpora 2016-2017 (LDC2019T12<https://catalog.ldc.upenn.edu/LDC2019T12>).

TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. *

Membership Office Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11502 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20191205/e63448a2/attachment.txt>



More information about the Corpora mailing list