[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Apr 30 18:43:03 CEST 2007

*The Linguistic Data Consortium (LDC) would like to report on recent
developments and announce the availability of two new publications.*
* *
* **LDC Celebrates its Fifteenth Anniversary!*

*Free Google Data (Web 1T 5-gram) Available

*ISI Chinese-English Automatically Extracted Parallel Text

*TRECVID 2003 Keyframes & Transcripts

*LDC Celebrates its Fifteenth Anniversary!

April 15, 2007 marked the start of the LDC's 15th Anniversary year! We
have many milestones to celebrate this year including the growth of our
staff to include over 40 full-time employees and a online catalog that
includes over 350 linguistic databases. Since 1992, no less than 2,300
organizations from over 80 different nations have licensed LDC data.
This data has been made available through donations, funded projects at
LDC or elsewhere, community initiatives, and, increasingly, LDC
initiatives. Over the past fifteen years, the LDC has grown from an
organization that shares existing language technology resources to one
that also is at the forefront of the creating new data resources,
software tools, and standards.

As we celebrate throughout the year, look for new membership offerings
and announcements. And be sure to join us as we count down to the much
anticipated distribution of our 50,000th publication.

*Free Google Data Available*

The LDC is pleased to announce that Google Inc. is providing financial
support for the distribution of its Web 1T 5-gram (LDC2006T13) corpus to
universities. As
a result, LDC will make the corpus available at no charge to 50
non-member universities requesting a copy. Shipping and handling fees
are also being covered by Google. Note that quantities are limited and
the Web 1T 5-gram data is a popular publication. We appreciate Google's
generosity and its interest in supporting language research. To obtain
a free copy, universities will need to sign and submit a copy of the
*User License Agreement for Web 1T 5-gram Version
*. Please email ldc at ldc.upenn.edu with your contact information.

*New Publications*

*(1) ISI Chinese-English Automatically Extracted Parallel Text
consists of Chinese-English parallel sentences, which were extracted
automatically from two monolingual corpora: Chinese Gigaword Second
Edition (LDC2006T02) and English Gigaword Second Edition (LDC2005T12).
The data was extracted from news articles published by Xinhua News Agency.

The corpus contains 558,567 sentence pairs; the word count on the
English side is approximately 16M words. The sentences in the parallel
corpus preserve the form and encoding of the texts in the original
Gigaword corpora.

For each sentence pair in the corpus the authors provide the names of
the documents from which the two sentences were extracted, as well as a
confidence score (between 0.5 and 1.0), which is indicative of their
degree of parallelism. The parallel sentence identification approach is
designed to judge sentence pairs in isolation from their contexts, and
can therefore find parallel sentences within document pairs which are
not parallel. The fact that two documents share several parallel
sentences does not necessarily mean the documents are parallel

In order to make this resource useful for research in Machine
Translation (MT), the authors made efforts to detect potential overlaps
between this data and the standard test and development data sets used
by the MT community.


TREC Video Retrieval Evaluation (TRECVID) is sponsored by the National
Institute of Standards and Technology (NIST) to promote progress in
content-based retrieval from digital video via open, metrics-based
evaluation. The keyframes in TRECVID 2003 Keyframes & Transcripts
*were extracted for use in the NIST TRECVID 2003 Evaluation. The
source data used were English language broadcast programming collected
by LDC in 1998 from ABC ("World News Tonight") and CNN ("CNN Headline

TRECVID is a laboratory-style evaluation that attempts to model real
world situations or significant component tasks involved in such
situations. In 2003 there were four main tasks with associated tests:

* shot boundary determination

* story segmentation

* high-level feature extraction

* search (interactive and manual)

Shots are fundamental units of video, useful for higher-level
processing. To create the master list of shots, the video was segmented.
The results of this pass are called subshots. Because the master shot
reference is designed for use in manual assessment, a second pass over
the segmentation was made to create the master shots of at least 2
seconds in length. These master shots are the ones used in submitting
results for the feature and search tasks in the evaluation. In the
second pass, starting at the beginning of each file, the subshots were
aggregated, if necessary, until the current shot was at least 2 seconds
in duration, at which point the aggregation began anew with the next

The keyframes were selected by going to the middle frame of the shot
boundary, then parsing left and right of that frame to locate the
nearest I-Frame. This then became the keyframe and was extracted.
Keyframes have been provided at both the subshot (NRKF) and master shot
(RKF) levels.


Ilya Ahtaridis
Membership Coordinator

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20070430/45a30347/attachment.html

More information about the Corpora-archive mailing list