[Corpora-List] January 2019 Newsletter - LDC

Penn LDC ldc at ldc.upenn.edu
Tue Jan 15 16:55:48 CET 2019


January 2019 Newsletter In this newsletter: Renew Your LDC Membership Today

New publications:

BOLT Arabic Discussion Forum Parallel Training Data<https://catalog.ldc.upenn.edu/LDC2019T01> SRI Speech-Based Collaborative Learning Corpus<https://catalog.ldc.upenn.edu/LDC2019S01> TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015<https://catalog.ldc.upenn.edu/LDC2019T02>

Renew Your LDC Membership Today Join LDC while membership savings are still available. Now through March 1, 2019, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join the Consortium or renew their membership. This year's planned publications include Multilanguage Conversational Telephone Speech (telephone speech in languages/dialects considered mutually intelligible or closely related), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), Chinese Abstract Meaning Representation Corpus, SRI Speech-Based Collaborative Learning Corpus, data from BOLT, HAVIC, DEFT, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership options and benefits.

New publications: (1) BOLT Arabic Discussion Forum Parallel Training Data<https://catalog.ldc.upenn.edu/LDC2019T01> was developed by LDC and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations.

LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The full source data collection is released as BOLT Arabic Discussion Forums (LDC2018T10<https://catalog.ldc.upenn.edu/LDC2018T10>).

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were then segmented into sentence units, formatted into a human-readable translation format, and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

BOLT Arabic Discussion Forum Parallel Training Data is available as a web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) SRI Speech-Based Collaborative Learning Corpus<https://catalog.ldc.upenn.edu/LDC2019S01> was developed by SRI International<https://www.sri.com/> and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of collaboration, log files, and supporting documentation.

This collection was part of a project investigating the utility of a speech-based learning analytics approach to collaborative learning. The goal was to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality. The participants were students in middle schools (grades six, seven, and eight) located in California. Students worked in groups of three on sets of short mathematics problems based on the "cloze" task in which each student was assigned one blank and each problem required the students to work together and talk to each other to coordinate their three answers. The problems were presented on iPads with a custom software application and the audio data was captured by both head-mounted and table-top microphones.

SRI Speech-Based Collaborative Learning Corpus is available as a web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015<https://catalog.ldc.upenn.edu/LDC2019T02> was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015. It includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. Also included in this data set are all necessary source documents as well as BaseKB - the second reference KB that was adopted for use by EDL in 2015. The first EDL reference KB to which 2014 EDL data are linked is available separately as TAC KBP Reference Knowledge Base (LDC2014T16<https://catalog.ldc.upenn.edu/LDC2014T16>).

The goal of the EDL track is to conduct end-to-end entity extraction, linking, and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB.

Source data consists of Chinese, English, and Spanish newswire and web text collected by LDC. The EDL 2014 task involved English data only. Chinese and Spanish data were added in the 2015 task.

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 is available as a web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Membership Office Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu> M: 3600 Market St. Suite 810

Philadelphia, PA 19104

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 14387 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190115/9a9f1400/attachment.txt>



More information about the Corpora mailing list