[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Fri May 21 19:16:21 CEST 2010

/In this newsletter:/*

- Coming Soon: LDC Data Scholarship Program! <#data> -*

/New publications: / LDC2010S03 *- 2003 NIST Speaker Recognition Evaluation <#2003sre> -*

LDC2010T09 *- ACE 2005 Mandarin SpatialML Annotations <#ace2005> -*

LDC2010T10 *- NIST 2002 Open Machine Translation (OpenMT) Evaluation <#2002mt> -***


* *

*Coming Soon: LDC Data Scholarship Program!*

We are pleased to announce that the LDC Data Scholarship program is in the works! This program will provide university students with access to LDC data at no-cost. Each year LDC distributes thousands of dollars worth of data at no- or reduced-cost to students who demonstrate a need for data, yet cannot secure funding. LDC will formalize this practice through the newly created LDC Data Scholarship program.

Data scholarships will be offered each semester beginning with the fall 2010 semester (September - December 2010). Students will need to complete an application, which will include a data use proposal and letter of support from their faculty adviser. We anticipate that the selection process will be highly competitive.

Stay tuned for further announcements in our newsletter and on our home page!

[ top <#top>]

*New Publications*

* *

(1) 2003 NIST Speaker Recognition Evaluation <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010S03> was developed by researchers at NIST (National Institute of Standards and Technology). It consists of just over 120 hours of English conversational telephone speech used as training data and test data in the 2003 Speaker Recognition Evaluation (SRE), along with evaluation metadata and test set answer keys.

2003 NIST Speaker Recognition Evaluation is part of an ongoing series of yearly evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation was designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.

This speaker recognition evaluation focused on the task of 1-speaker and 2-speaker detection, in the context of conversational telephone speech. The original evaluation consisted of three parts: 1-speaker detection "limited data", 2-speaker detection "limited data", and 1-speaker detection "extended data". This corpus contains training and test data and supporting metadata (including answer keys) for only the 1-speaker "limited data" and 2-speaker "limited data" components of the original evaluation. The 1-speaker "extended data" component of the original evaluation (not included in this corpus) provided metadata only, to be used in conjunction with data from Switchboard-2 Phase II (LDC99S79) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99S79> and Switchboard-2 Phase III Audio (LDC2002S06) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002S06>. The metadata (resources and answer keys) for the 1-speaker "extended data" component of the original 2003 SRE evaluation are available from the NIST Speech Group website for the 2003 Speaker Recognition Evaluation <http://www.itl.nist.gov/iad/mig/tests/sre/2003/index.html>.

The data in this corpus is a 120-hour subset of data first made available to the public as Switchboard Cellular Part 2 Audio (LDC2004S07) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S07>, reorganized specifically for use in the 2003 NIST SRE.

[ top <#top>]


(2) ACE 2005 Mandarin SpatialML Annotations <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T09> was developed by researchers at The MITRE Corporation <http://www.mitre.org/> (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags to a subset of the source Mandarin training data in ACE 2005 Multilingual Training Corpus (LDC2006T06).

SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services.

The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2 <http://fofoca.mitre.org/>, TimeML <http://www.timeml.org/site/index.html>, and the 2005 ACE guidelines <http://www.itl.nist.gov/iad/mig/tests/ace/2005/doc/ace05eval_official_results_20060110.html>. The main SpatialML tag is the PLACE tag which encodes information about location. The central goal of SpatialML is to map location information in text to data from gazetteers and other databases to the extent possible by defining attributes in the PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program.

This corpus consists of a 298-document subset of broadcast material from the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been tagged by a native Mandarin speaker according to version 2.3 of the SpatialML annotation guidelines, which are included in the documentation for this release.

[ top <#top>]


(3) NIST 2002 Open Machine Translation (OpenMT) Evaluation <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T10> is a package containing source data, reference translations, and scoring software used in the NIST 2002 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. The package was compiled and scoring software was developed by researchers at NIST, making use of newswire source data and reference translations collected and developed by LDC.

The objective of the NIST OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. Additional information about these evaluations may be found at the NIST Open Machine Translation (OpenMT) Evaluation web site <http://www.itl.nist.gov/iad/mig/tests/mt/>.

This evaluation kit includes a single perl script that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

The Chinese-language source text included in this corpus is a reorganization of data that was initially released to the public as Multiple-Translation Chinese (MTC) Part 2 (LDC2003T17) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T17>. The Chinese-language reference translations are a reorganized subset of data from the same MTC corpus. The Arabic-language data (source text and reference translations) is a reorganized subset of data that was initially released to the public as Multiple-Translation Arabic (MTA) Part 1 (LDC2003T18) <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T18>. All source data for this corpus is newswire text.

For each language, the test set consists of two files, a source and a reference file. Each reference file contains four independent translations of the data set. The evaluation year, source language, test set, version of the data, and source vs. reference file are reflected in the file name.

[ top <#top>]


Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11244 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20100521/501b43a5/attachment.txt>

More information about the Corpora mailing list