[Corpora-List] New from LDC

Linguistic Data Consortium ldc
Thu Apr 23 23:31:02 CEST 2009


LDC2009L01 *- An English Dictionary of the Tamil Verb, Second Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01> -*

LDC2009T08 *- Japanese Web N-gram Version 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08> -*

The Linguistic Data Consortium (LDC) is pleased to announce the availability of two new publications. **

------------------------------------------------------------------------

*New Publications*

(1) An English Dictionary of the Tamil Verb, Second Edition <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01> represents over twenty-five years of work led by Harold F. Schiffman, Professor, emeritus, of Dravidian Linguistics and Culture at the University of Pennsylvania's Department of South Asia Studies. It contains translations for 6597 English verbs and defines 9716 Tamil verbs. This release presents the dictionary in two formats: Adobe PDF and XML. The PDF format displays the dictionary in a human readable form. The XML version is a purely electronic form and is intended mainly for application development and the creation of searchable electronic databases.

In the electronic XML version each entry contains the following: the English entry or head word; the Tamil equivalent (in Tamil script and transliteration); the verb class and transitivity specification; the spoken Tamil pronunciation (audio files in mp3 format); the English definition(s); additional Tamil entries (if applicable); example sentences or phrases in Literary Tamil, Spoken Tamil (with a corresponding audio file in .mp3 format) and an English translation; and Tamil synonyms or near-synonyms, where appropriate. It is expected that the dictionary will be useful for Tamil learners, scholars and others interested in the Tamil language.

What's New in the Second Edition?

* Errors in the Tamil text and the roman transliteration have been

corrected.

* Audio files have been updated and corrected and missing files have

been added.

* A brand new search and browse application that can access the

audio has been included in this edition. This application can be

accessed from the tools directory.

* The XML structure has been modified to normalize the presentation

of synonyms.

An English Dictionary of the Tamil Verb seeks to meet needs not currently addressed by existing English-Tamil dictionaries. The main goal of this dictionary is to get an English-knowing user to a Tamil verb, irrespective of whether he or she begins with an English verb or some other item, such as an adjective; this is because what may be a verb in Tamil may in fact not be a verb in English, and vice versa. The main goal is to specifically concentrate on supplying the kinds of information lacking in all previous attempts to capture the equivalencies between English and Tamil.

*

2) Japanese Web N-gram Version 1 <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T08> was created by Google Inc. It consists of Japanese "word" n-grams and their observed frequency counts generated from over 255 billion tokens of text. The length of the n-grams ranges from unigrams to seven-grams.

The n-grams were extracted from publicly accessible web pages that were crawled by Google in July 2007. This data set contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. Those web pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions were excluded from the final release. While the aim was to process only Japanese pages, the corpus may contain some pages in other languages due to language detection errors. This dataset will be useful for research in areas such as statistical machine translation, language modeling and speech recognition, among others.

Before the n-grams were collected, the web pages were converted into UTF-8 encoding, normalized into Unicode Normalization Form KC, and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words". The vocabulary was restricted to "words" that appeared at least 50 times in the processed sentences. Less frequent words were replaced with the "<UNK>" special token.

------------------------------------------------------------------------ Ilya Ahtaridis Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu

Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6518 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090423/93ff6d3f/attachment.txt



More information about the Corpora mailing list