[Corpora-List] Universal POS Tagset

Christian Chiarcos christian.chiarcos at web.de
Wed Feb 4 14:06:10 CET 2009


Hi Adam,

there are numerous approaches for approaches involving tag set translation, tag set "interlinguas" or tag sets covering multiple languages, yet to my knowledge, all of these are restricted to a limited set of languages or a specific region.

The historically most important approach in this direction is probably represented by the EAGLES Recommendations for the Morphosyntactic Annotation of Corpora (http://www.ilc.cnr.it/EAGLES/annotate/annotate.html) that aim to provide a pan-European tag set, and the Multext-East standard extended these for Eastern European languages (http://nl.ijs.si/ME). However, it is highly questionable whether such standardization approaches can be extended far beyond a restricted region and a specific language family (cf. Khoja et al. 2001 on the development of an Arabic tagset independently from EAGLES).

So, I'm afraid to tell you that what you're looking for might not exist.

Still, natural candidates for cross-linguistic applicable sets of annotation values for POS annotation are the Data Category Registry (http://www.isocat.org/) or the General Ontology of Linguistic Description (http://linguistics-ontology.org/).

Yet, these do not represent tag sets in a strict sense, but general inventories of annotation terminology. The main difference is that annotation values in a tag set are mutually exclusive, whereas different levels of descriptions influencing the design of a POS tag set (syntax, semantics, morphology, lexical amiguity, ...) may overlap. So, attributive possessive pronouns ("*her* child") are pronouns on semantic and morphologic grounds, but syntactically determiners. The ordinal number in "I'm the first." is semantically a number, but syntactically a nominal (head of an NP), etc. A terminological repository may allow for such conceptual overlap, but a tag set needs to resolve these conflicts to justify the assignment of a specific tag, and they adopt different strategies and preferences to cope with such misclassifications, e.g., to tag a only cardinal numbers as numerals and ordinate numbers as adjectives, or to use the tag for determiner only if the determiner is not a possessive pronoun, etc. These were just two examples from English as a familiar and well-understood language. Such conceptual mismatches substantially increase with the number of languages participating and what exact selection strategy lies behind a specific tag is basically arbitrary.

For morphological categories, there is a similar problem: Reference terminologies may tell you that there are labels for morphological case such as prepositional or locative, but they don't really tell you whether or not these labels refer to identical or distinct cases in one language or the other. Considering Russian, the prepositional case is occasionally referred to as locative -- this is actually only partly correct, as there are non-locative uses of the prepositional case (http://en.wikipedia.org/wiki/Locative_case#Slavic_languages). So, if you're going to investigate the distribution of locative case marking throughout the world, you may find that some Slavic languages have a locative (in their tag set), but others don't (because it is referred to as prepositional in these tag sets), but what you're evaluating, is in the end just a design decision of some tag set designer.

As for a more extreme example, consider the existence of a "verbal participle" in Inuktitut (http://www2.tu-berlin.de/fak1/el/board.cgi?id=angli&action=download&gul=124). Sounds like a participle as we know it and it would be probably tagged as such in a language-specific tagset, because it is an established term. However, as opposed to Indo-European participles, this is a finite verb (only that, by chance, it is systematically translated by an English progressive participle): the verbal participle is merely a specific mood of the verb indicating the temporal parallelity of multiple events, with normal verbal inflection. So, in the end, what specific conclusions can you draw of the existence of a tag "participle" there ?

To make a long story short, there is no universal POS tag set, and the right questions would have to be "Can there be a universal POS tagset at all ?" and "If applying it to my data, how much noise am I willing to take into account ?"

As you may guess, I have substantial doubts, not only because of the limited expressivity of tag sets (basically 1:1 matches: one tag = one language-specific category = one universal category = one phenomenon ?), but also because of the multitude of terminological traditions and linguistic disciplines involved (ranging from typology to NLP). Actually, a closer comparison between EAGLES (with a primary focus on NLP) and GOLD (with a primary focus on typology and language documentation) reveals quite a number of systematic mismatches in the conceptualization, e.g., in the subcategorization of nouns or pronouns/determiners/quantifiers. So, it seems that in its current state neither of these is to be regarded a terminological reference for cross-linguistic, cross-discipline linguistic annotation. However, GOLD is intended to be a community project, and so is the Data Category Registry, and possibly, these efforts converge one day into a general repository of annotation terminology usable to all linguists working with linguistic annotations.

But even then, they will not represent a tag set in a proper sense, for the reasons given above. The question then remains how to bridge the gap between such a general repository of annotation terminology (potentially overlapping, general concepts) and concrete tag sets (mutually disjoint, language- or tagset-specific tags). I do have a suggestion for this, but this certainly belongs to an independent thread ...

Best,

Christian -- Christian Chiarcos Universitšt Potsdam Collaborative Research Center 632, Project D1 "Linguistic Data Base for Information Structure" Co-project "Sustainability of linguistic data" snail: Karl-Liebknecht-Str. 24-25, D-14476 Potsdam-Golm office: II.24.2.68 email: chiarcos at uni-potsdam.de web: http://www.sfb632.uni-potsdam.de/~chiarcos tel.: +49-(0)331/977-2664 fax: +49-(0)331/977-2925



More information about the Corpora mailing list