[Corpora-List] New from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 26 22:49:46 CET 2008


LDC2008T04 *- OntoNotes Release 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04> -*

LDC2008T05 *- Penn Discourse Treebank Version 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05> -*

*- 2007 Member Survey Responses -* * - 2008 Publications Pipeline -

*

------------------------------------------------------------------------

* New Publications *

(1) The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

OntoNotes Release 1.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T21> contains 400k words of Chinese newswire data and 300k words of English newswire data. The current release, OntoNotes Release 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>, adds the following to the corpus: 274k words of Chinese broadcast news data and 200k words of English broadcast news data. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference.

*

(2) The Penn Discourse Treebank (PDTB) <http://www.seas.upenn.edu/%7Epdtb> Project is located at the Institute for Research in Cognitive Science at the University of Pennsylvania. The goal of the project is to develop a large scale corpus annotated with information related to discourse structure. Penn Discourse Treebank Version 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T05> contains annotations of discourse relations and their arguments on the one million word Wall Street Journal (WSJ) data in Treebank-2 (LDC95T7). <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>

The PDTB focuses on encoding discourse relations associated with discourse connectives, adopting a lexically grounded approach for the annotation. The corpus provides annotations for the argument structure of Explicit and Implicit connectives, the senses of connectives and the attribution of connectives and their arguments. The lexically grounded approach exposes a clearly defined level of discourse structure which will support the extraction of a range of inferences associated with discourse connectives.

The PDTB annotates semantic or informational relations holding between two (and only two) Abstract Objects (AOs), expressed either explicitly via lexical items or implicitly via adjacency. For the former, the lexical items anchoring the relation are annotated as Explicit connectives. For the latter, the implicit inferable relations are annotated by inserting an Implicit connective that best expresses the inferred relation.

Explicit connectives are identified from three grammatical classes: subordinating conjunctions (e.g., because, when), coordinating conjunctions (e.g., and, or), and discourse adverbials (e.g., however, otherwise). Arguments of connectives are simply labeled Arg2 for the argument appearing in the clause syntactically bound to the connective, and Arg1 for the other argument. In addition to the argument structure of discourse relations, the PDTB also annotates the attribution of relations (both explicit and implicit) as well as of each of their arguments.

The current release contains 40600 discourse relations annotations, distributed into the following five types: Explicit Relations, Implicit Relations, Alternative Lexicalizations, Entity Relations, and No Relations.

*2007 Member Survey Responses

*

Please click here <https://secure.ldc.upenn.edu/intranet/surveyStatsPublic_2007.jsp?survey_id=1> to access a summary of the responses to Questions 1-15 of the 2007 Member Survey. These questions were sent to all survey recipients.

We also received many suggestions for future releases, among them:

* More African language publications

* Gigaword corpora in additional languages

* More annotated data for a greater variety of uses

* More parallel text corpora

* Web blogs and chat room data

Several corpora that would satisfy these needs are prospective 2008 publications.

The winner of the blind drawing for the $500 benefit for survey responses received by January 14, 2008 is Richard Rose of McGill University. Congratulations!

*2008 Publications Pipeline *

Membership Year (MY) 2008 is shaping up to be another productive one for the LDC. We anticipate releasing a balanced and exciting selection of publications. Here is a glimpse of what is in the pipeline for MY2008. (Disclaimer: unforeseen circumstances may lead to modifications of our plans. Please regard this list as tentative).

* BLLIP 1994-1997 News Text Release 1 - automatic parses for the

North American News Text Corpus - NANT (LDC95T21). The parses were

generated by the Charniak and Johnson Reranking Parser which was

trained on Wall Street Journal (WSJ) data from Treebank 3

(LDC99T42). Each file is a sequence of n-best lists containing the

top n parses of each sentence with the corresponding parser

probability and reranker score. The parses may be used in systems

that are trained off labeled parse trees but require more data

than found in WSJ. Two versions will be released: a complete

'Members-Only' version which contains parses for the entire NANT

Corpus and a 'Non Member' version for general licensing which

includes all news text except data from the Wall Street Journal.

* Chinese Proposition Bank - the goal of this project is to create

a corpus of text annotated with information about basic semantic

propositions. Predicate-argument relations are being added to the

syntactic trees of the Chinese Treebank Data. This release

contains the predicate-argument annotation of 81,009 verb

instances (11,171 unique verbs) and 14,525 noun instances (1,421

unique nouns). The annotation of nouns are limited to

nominalizations that have a corresponding verb.

* English Dictionary of the Tamil Verb - contains translations for

6597 English verbs and defines 9716 Tamil verbs. Each entry

contain the following: the English entry or head word; the Tamil

equivalent (in Tamil script and transliteration); the verb class

and transitivity specification; the spoken Tamil pronunciation

(audio files in mp3 format); the English definition(s); additional

Tamil entries (if applicable); example sentences or phrases in

Literary Tamil, Spoken Tamil (with a corresponding audio file) and

an English translation; and Tamil synonyms or near-synonyms, where

appropriate.

* GALE Phase 1 Arabic Blog Parallel Text - contains a total of 102K

words (222 files) of Arabic blog text selected from 33 sources.

Blogs consist of posts to informal web-based journals of varying

topical content. Manual sentence units/segments (SU) annotation

was also performed on a subset of files following LDC's Quick Rich

Transcription specification. Files were translated according to

LDC's GALE Translation guidelines.

* GALE Phase 1 Chinese Blog Parallel Text - contains a total of 313K

characters (277 files) of Chinese blog text selected from 8

sources. Blogs consist of posts to informal web-based journals of

varying topical content. Manual sentence units/segments (SU)

annotation was also performed on a subset of files following LDC's

Quick Rich Transcription specification. Files were translated

according to the LDC's GALE Translation guidelines.

* GALE Phase 1 Arabic Newsgroup Parallel Text - contains a total of

178K words (264 files) of Arabic newsgroup text selected from 35

sources. Newsgroups consist of posts to electronic bulletin

boards, Usenet newsgroups, discussion groups and similar forums.

Manual sentence units/segments (SU) annotation was also performed

on a subset of files following LDC's Quick Rich Transcription

specification. Files were translated according to LDC's GALE

Translation guidelines.

* GALE Phase 1 Chinese Newsgroup Parallel Text - contains a total of

240K characters (112 files) of Chinese newsgroup text selected

from 25 sources. Newsgroups consist of posts to electronic

bulletin boards, Usenet newsgroups, discussion groups and similar

forums. Manual sentence units/segments (SU) annotation was also

performed on a subset of files following LDC's Quick Rich

Transcription specification. Files were translated according to

the LDC's GALE Translation guidelines.

* Hindi WordNet - first wordnet for an Indian language. Similar in

design to the Princeton Wordnet for English, it incorporates

additional semantic relations to capture the complexities of

Hindi. The WordNet contains 28604 synsets and 63436 unique words.

Created by the NLP group at Indian Institute of Technology Bombay,

it is inspiring construction of wordnets for many other Indian

languages, notably Marathi.

* LCTL Bengali Language Pack - a set of linguistic resources to

support technological improvement and development of new

technology for the Bengali language created in the Less Commonly

Taught Languages (LCTL) project which covered a total of _

languages. Package components are: 2.6 million tokens of

monolingual text, 500,000 tokens of parallel text, a bilingual

lexicon with 48,000 entries, sentence and word segmenting tools,

an encoding converter, a part of speech tagger, a morphological

analyzer, a named entity tagger and 136,000 tokens of named entity

tagged text, a Bengali-to-English name transliterator, and a

descriptive grammar created by a PhD research linguist. About

30,000 tokens of the parallel text are English-to-LCTL

translations of a "Common Subset" corpus, which will be included

in all additional LCTL Language Packs.

* North American News Text Corpus (NANT) Reissue - as a companion to

BLLIP 1994-1997 News Text Release 1, LDC will reissue the North

American News Text Corpus (LDC95T21). Data includes news text

articles from several sources (L.A.Times/Washington Post, Reuters

General News, Reuters Financial News, Wall Street Journal, New

York Times) that has been formatted with TIPSTER-style SGML tags

to indicate article boundaries and organization of information

within each article. Two versions will be released: a complete

'Members-Only' version which contains all previously released NANT

articles and a 'Non Member' version for general licensing which

includes all news text except data from the Wall Street Journal.

------------------------------------------------------------------------

Ilya Ahtaridis Membership Coordinator --------------------------------------------------------------------

* Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu Philadelphia, PA 19104 USA http://www.ldc.upenn.edu*

-------------- next part -------------- An HTML attachment was scrubbed... URL: https://mailman.uib.no/public/corpora/attachments/20080226/a092366b/attachment.html



More information about the Corpora mailing list