[Corpora-List] Universal POS Tagset

Damir C'avar dcavar at indiana.edu
Mon Feb 2 17:48:04 CET 2009

maxwell at umiacs.umd.edu wrote:
>> I've been looking for a POS tagset that is general enough to
>> effectively tag "any" natural language. (I'm looking at Linguistic
>> Typology / Universal Implications so I want to compare POS taggings
>> across many [possibly obscure] languages.) Does anyone know of such a
>> tagset?
> The other effort is the GOLD ontology,
> http://linguistics-ontology.org/gold.html. This ontology has been
> populated by people who know about a very large variety of languages (with
> initial input from a list compiled by SIL). It is not really intended as
> a list of tags (or of tag components), although you could use it that way,
> but rather it is intended as something that a tag list could be defined by
> reference to. For example, it is common in Nahuatl to refer to the
> 'absolutive' form of a noun. This has nothing to do with the ergative/
> absolutive distinction, but it is nevertheless a standard usage among
> Nahuatl (maybe even Uto-Aztecan) linguists. The idea behind Gold is that
> a Nahuatl linguist would continue to use the standard 'absolutive' term/
> tag, but define it in terms of the categories in the Gold ontology.

The GOLD ontology is missing some concepts (features and properties) for some (maybe many) languages, but the process for extending it is somewhat defined. There is e.g. a Google group where issues can be discussed:


Indeed, one good idea would be to have axioms and concepts getting into GOLD, to extend its usability for a wider range of scenarios and research questions. The comparisons you mention would be exactly what we would like to see, e.g. some sort of typology of languages via individual instantiations of GOLD (for the qualitative comparison, and qualitative cross-dependencies between features), as well as via annotated corpora for quantitative differences and similarities.

We used the GOLD Ontology in our morphological parser for Croatian (CroMo), and we looked somewhat at the possibility to map it to other common tagsets. Our goal was exactly this, being able to run qualitative and quantitative similarity measures across languages and corpora via some general tagset (and mappings of others to this one, so that we can use existing corpora).

Mapping of e.g. MULTEXT (EAST) is somewhat possible (maybe somewhere loosing specific properties that GOLD would have, but MULTEXT not etc.

ciao DC

-------------- next part -------------- A non-text attachment was scrubbed... Name: dcavar.vcf Type: text/x-vcard Size: 225 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090202/bdde199d/attachment.vcf

More information about the Corpora mailing list