[Corpora-List] Universal POS Tagset

maxwell at umiacs.umd.edu maxwell at umiacs.umd.edu
Mon Feb 2 16:35:40 CET 2009



> I've been looking for a POS tagset that is general enough to
> effectively tag "any" natural language. (I'm looking at Linguistic
> Typology / Universal Implications so I want to compare POS taggings
> across many [possibly obscure] languages.) Does anyone know of such a
> tagset?

One of the issues is going to be at what level of detail one wants the tags. If it's just the standard parts of speech (noun, verb, pre-/post-position...), it might not be hard to come up with a list, although there would be problems in particular languages (is the 'for' of English for-to clauses a preposition or a complementizer, and is there really a difference?).

If on the other hand, you want to tag things like person, number etc., which plenty of taggers have done, then there is a very long list of features and feature values which one might tag. There are for example languages which, in addition to the usual singular/ plural distinctions in the number feature, distinguish dual, trial, paucal, etc.; and languages which have far different gender classes than are dreamed of in most categorizations. And there are languages which morphologically mark verbs for such things as agreement with ergative and absolutive arguments, and evidential status (seen/ inferred/ reportedly etc.).

Yet another issue for standardized tag sets is that some morphosyntactic feature values will cover a wider range in one language than they might in another, or values will overlap in different ways in different languages. Case systems are notoriously like that.

I know of two efforts to come up with lists of tags (in addition to the responses you've already gotten). One is the ISO TC 37/SC4 effort for lexicons, which uses a "Data Category Registry" to register tags for use in electronic lexicons; see http://www.isocat.org. The last time I looked, this struck me as rather Euro-centric, meaning that it might not be a good fit for "possibly obscure" languages.

The other effort is the GOLD ontology, http://linguistics-ontology.org/gold.html. This ontology has been populated by people who know about a very large variety of languages (with initial input from a list compiled by SIL). It is not really intended as a list of tags (or of tag components), although you could use it that way, but rather it is intended as something that a tag list could be defined by reference to. For example, it is common in Nahuatl to refer to the 'absolutive' form of a noun. This has nothing to do with the ergative/ absolutive distinction, but it is nevertheless a standard usage among Nahuatl (maybe even Uto-Aztecan) linguists. The idea behind Gold is that a Nahuatl linguist would continue to use the standard 'absolutive' term/ tag, but define it in terms of the categories in the Gold ontology.

Mike Maxwell

CASL/ U MD



More information about the Corpora mailing list