[Corpora-List] News from LDC - July 2013

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Jul 23 22:37:53 CEST 2013


*- Fall 2013 Data Scholarship Program <#scholar> - *

/New publications:/* *

*- Chinese Proposition Bank 3.0 <#prop> - *

*- GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 <#gale> -*

------------------------------------------------------------------------

*Fall 2013 Data Scholarship Program*

Applications are now being accepted through September 16, 2013, 11:59PM EST for the Fall 2013 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) /Data Use Proposal/. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog <http://www.ldc.upenn.edu/Catalog/index.jsp>for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.

(2) /Letter of Support/. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, please visit the LDC Data Scholarship <http://www.ldc.upenn.edu/About/scholarships.html>page.

Students can email their applications to the LDC Data Scholarship program <mailto:datascholarships at ldc.upenn.edu>. Decisions will be sent by email from the same address.

The deadline for the Fall 2013 programis Monday, September 16, 2013, 11:59PM EST.

* New publications*

(1) Chinese Proposition Bank 3.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T13>is a continuation of the Chinese Proposition Bank <http://www.cs.brandeis.edu/%7Eclp/ctb/cpb/>project which aims to create a corpus of text annotated with information about basic semantic propositions. Chinese Proposition Bank 3.0 adds predicate-argument annotation on 187,731 words from Chinese Treebank 7.0 (LDC2010T07 <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T07>). The data sources are comprised of newswire, magazine articles, various broadcast news and broadcast conversation programming, web newsgroups and weblogs.

LDC has also released Chinese Proposition Bank 1.0 (LDC2005T23 <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T23>) and Chinese Proposition Bank 2.0 (LDC2008T07 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T07>).

This release contains the predicate-argument annotation of 173,206 verb instances and 14,525 noun instances. The annotation of nouns is limited to nominalizations that have a corresponding verb. The general annotation guidelines and the lexical guidelines (called frame files) for each verbal and nominal predicate are also included in this release. Below are some statistics about the corpus.

* Total propositions for verbs - 173,206

* Total propositions for nouns - 14,525

* Total verbs framed - 24,642

* Total framesets - 26,467

* Verbs with multiple framesets - 1337

* Average framesets per verb - 1.07

* Total nouns framed - 1,421

* Total noun framesets - 1,528

* Nouns with multiple framesets - 48

* Average framesets per nouns - 1.08

*

(2) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T14>was developed by LDC and contains 115,826 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2012T07>).

The source data consists of Arabic broadcast news programming collected by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai TV. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language

Files

Words

Tokens

Segments

Arabic

28

89,213

115,826

4,824

Note: Word count is based on the untokenized Arabic source. Ttoken count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:

* Identifying different types of links: translated (correct or

incorrect) and not translated (correct or incorrect)

* Identifying sentence segments not suitable for annotation, e.g.,

blank segments, incorrectly-segmented segments, segments with

foreign languages

* Tagging unmatched words attached to other words or phrases

------------------------------------------------------------------------

-- --

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: 1 (215) 573-1275 University of Pennsylvania Fax: 1 (215) 573-2175 3600 Market St., Suite 810ldc at ldc.upenn.edu Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 30033 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130723/c73b1d69/attachment.txt>



More information about the Corpora mailing list