[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Oct 27 20:50:27 CET 2008

- *Programmer Analyst Position at LDC -*

LDC2008T22 - *Czech Academic Corpus 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22> * -

LDC2008T19 - *The New York Times Annotated Corpus <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19> * -

The Linguistic Data Consortium (LDC) would like to announce a programmer analyst opening and the availability of two new publications.


* *

*Programmer Analyst Position at LDC *

The Linguistic Data Consortium (LDC) at the University of Pennsylvania, Philadelphia, PA has an immediate opening for a full-time programmer analyst.

Programmer Analyst -- Publications Programmer (#081025790)

Duties: Position will have primary responsibility for developing, implementing and managing data processing systems required to coordinate and prepare publications of language resources used for human language technology research and technology development. Such resources include video, computer-readable speech, software and text data that are distributed via media and internet. Position will communicate with external data providers and internal project managers to acquire raw source material and to schedule releases; perform quality assessment of large data collections and render analyses/descriptions of their formats; create or adapt software tools to condition data to a uniform format and level of quality (e.g., eliminating corrupted data, normalizing data, etc.); validate quality control standards to published data and verify results; document initial and final data formats; review author documentation and supporting materials; create additional documentation as needed; and master and replicate publications. Position will also maintain the publications catalog system, the publications inventory, the archive of publishable and published data and the publication equipment, software and licenses. Position requires attention to detail and is responsible for managing multiple short-term projects.

For further information on the duties and qualifications for this position, or to apply online please visit http://jobs.hr.upenn.edu/; search postings for the reference number indicated above.

Penn offers an excellent benefits package including medical/dental, retirement plans, tuition assistance and a minimum of 3 weeks paid vacation per year. The University of Pennsylvania is an affirmative action/equal opportunity employer.

Position contingent upon funding. For more information about LDC and the programs we support, visit http://www.ldc.upenn.edu/.

*New Publications*

(1) The Prague family of annotated corpora has a new member, the Czech Academic Corpus 2.0 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22> (CAC 2.0). CAC 2.0 consists of 650,000 words from various 1970s and 1980s newspapers, magazines and radio and television broadcast transcripts manually annotated for morphology and syntax.

The CAC 2.0 offers:

* For linguists: language material reflecting the real usage of the


* For computational linguists: tools and a considerable amount of

data for natural language applications that are not feasible

without morphological and syntactical text processing.

* For TrEd annotation tool users: the possibility to use voice

control for the tool.

* For teachers and their students: an interesting didactic tool for

practicing Czech language morphology and syntax.

CAC 2.0 was created by a team from the Institute of the Czech Language, the Academy of Sciences of the Czech Republic. The original purpose of the corpus was to build a frequency dictionary of the Czech language. Researchers were aware, however, that in order to make the CAC useful for future users, whether linguists or natural language processing systems developers, it was necessary to design annotation schemes and to develop tools that would add as much linguistic information as possible to the data. In 1996, the Prague Dependency Treebank (PDT) <http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/whatis.html>, which provided morphological and syntactic analytic layers of annotation to certain Czech media data, was launched independently of the CAC. During the work on the PDT's second version <http://ufal.mff.cuni.cz/pdt2.0/>, its researchers decided to transfer PDT's internal format and annotation scheme to the CAC with the goals of making the CAC and the PDT fully compatible and of integrating the CAC into the PDT. To that end, the CAC was manually annotated for morphology and syntax. CAC 2.0 adds the surface syntax annotation; in the terminology of the PDT, this annotation is called an analytical layer.

A morphological layer of annotation provides the word tokens with further data (annotation), which characterizes the morphological properties of the word tokens (as apparent in the lemma which is the canonical form of a lexeme), the part of speech, and morphological categories (case, number, tense, person, etc.). Formally, part of speech classes combine together with values of morphological categories to represent morphological tags (or, simply, tags). In the CAC 2.0, tags are designed according to the PDT as strings of definite length (15 positions) where each position corresponds to a single category.

In addition to CAC 2.0, the following PDT resources are available from LDC: Prague Dependency Treebank 1.0, LDC2001T10 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10>, Prague Dependency Treebank 2.0, LDC2006T01 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01>, Prague Arabic Dependency Treebank 1.0, LDC2004T23 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T23> and Prague Czech-English Dependency Treebank 1.0, LDC2004T25 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T25>

** *

(2) The New York Times Annotated Corpus <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19> contains over 1.8 million articles written and published by the New York Times with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com The corpus also provides associated Java software tools for parsing corpus documents from .xml into a memory resident object. This rich archive will be useful for a number of linguistic-related research applications, including the development of automatic document summarization systems and automatic content extraction technology.

Highlights of the corpus include:

* Over 1.8 million articles written and published between January 1,

1987 and June 19, 2007.

* Over 650,000 article summaries written by library scientists.

* Over 1.5 million articles manually tagged by library scientists

drawn from a normalized indexing vocabulary of people,

organizations, locations and topic descriptors.

* Over 275,000 algorithmically-tagged articles that have been hand

verified by the online production staff at nytimes.com.

* Java tools for parsing corpus documents from .xml into a memory

resident object.

The corpus text is formatted in News Industry Text Format (NITF), an XML specification that provides a standardized representation for the content and structure of discrete news articles. NITF includes structural markup such as bylines, headlines and paragraphs. The format also provides management attributes for categorizing articles into topics, summarization usage restrictions and revision histories.

The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.


Ilya Ahtaridis Membership Coordinator


Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11999 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081027/e115bbfe/attachment.txt

More information about the Corpora mailing list