[Corpora-List] Tag sets

Nicolas Torzec torzecn at yahoo-inc.com
Wed Feb 6 10:00:13 CET 2008


Hi, Having working on a similar project a few years ago, I think the following references could be of interest for your project.

1) TEI: Text Encoding Initiative

The Text Encoding Initiative (TEI) is a consortium of institutions and research projects which collectively maintains and develops a standard for the representation of texts in digital form. Its major deliverable is a set of Guidelines, which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. The Guidelines define some 400 different textual components and concepts, which can be expressed using a markup language and defined by a DTD or XML schema. => http://www.tei-c.org/index.xml

2) NSW: Normalization of Non-Standard Words

@misc{ sproat-article,

author = "Richard Sproat and Alan W Black and Stanley Chen and Shankar Kumar and Mari Ostendorf and Christopher Richards",

title = "Article Submitted to Computer Speech and Language ",

url = "citeseer.ist.psu.edu/537653.html"

} => http://www.clsp.jhu.edu/ws99/projects/normal/

Hope this helps. Nicolas

-- Nicolas Torzec Yahoo! Inc.

maxwell at umiacs.umd.edu wrote:
> We're looking at annotating a small sample (~5k words) of Bengali text,
> and later maybe Urdu and Punjabi. The annotation will be the dictionary
> citation form of each word. The texts are mostly news articles, so there
> are a fair number of words for which there won't be any dictionary
> citation form. These include many proper names, numerals, acronyms, and
> who knows what else. I'll refer to these as "non-dictionary words",
> whereas "dictionary words" will include words whose citation form is in
> the dictionary we're using, even if the inflected wordform itself is not.
> (We're doing this to test a morphological parser.)
>
> This is not quite the same as the inverse of named entity tagging, since
> some parts of names may have citation forms. For example, in English one
> would tag "Mississippi River" as a name. But "River" would be found in
> the dictionary, so for our purposes we would only want to tag
> "Mississippi" as a non-dictionary word.
>
> The simplest thing for us to do would be to just tag all such
> non-dictionary words the same way, e.g. with a tag "NOT". However, in the
> interest of future uses to which we might put such a tagged text, it might
> be good to differentiate among the various kinds of non-dictionary words.
>
> We could easily make up our own tagset for non-dictionary words, but it
> strikes me that better would be to use some standard tagset for such
> words, if such a tagset exists. There is a table of tagsets in Manning
> and Schutze pg. 141-2, including the Penn Treebank, Brown, and CLAWS.
> However, the tagsets are English-specific. This is especially noticeable
> in the punctuation tags for the PTB and Brown sets, but also e.g. in the
> decision to tag singular and plural proper nouns differently. (Some
> languages attach case markers to proper nouns.) Also, it appears that
> none of the tagsets distinguishes between numerals ('3', '4.5') and
> numbers written out ('three', 'four point five'), which we need to do, nor
> are acronyms distinguished from "symbols".
>
> Another distinction I thought about making is between "ordinary" Bengali
> names, and foreign names, since one might later want to develop a
> transducer to convert the latter into their more common Latin forms.
> However, I suspect that might be too difficult a distinction for
> annotators to make, and in any case some well-known Bengali names are
> likely to have "standard" transliterations.
>
> Does anyone know of a semi-standard tagset that would be less
> English-specific, and would make the kinds of distinctions among
> non-dictionary words that we want to (or should) make? Or should we just
> make up our own set?
>
> Mike Maxwell
> CASL/ U MD
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



More information about the Corpora mailing list