[Corpora-List] annotation tags

Amir Zeldes Amir.Zeldes at georgetown.edu
Tue Dec 15 16:33:05 CET 2020


Hi Hugh,

The POS tags and morphological category values used in the Universal Dependencies project are becoming popular across a range of languages, including both high and low resource languages. You can read more about the inventories here:

Universal (coarse) POS tags: https://universaldependencies.org/u/pos/index.html

Morphological features: https://universaldependencies.org/u/feat/index.html

Syntactic function labels: https://universaldependencies.org/u/dep/index.html

For morphological categories you may also want to check out UniMorph:

https://unimorph.github.io/

Best,

Amir

------------

Dr. Amir Zeldes

Assoc. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057

<https://corpling.uis.georgetown.edu/amir> https://corpling.uis.georgetown.edu/amir

From: corpora-bounces at uib.no <corpora-bounces at uib.no> On Behalf Of Hugh Paterson III Sent: Monday, December 7, 2020 4:06 PM To: corpora at uib.no Subject: [Corpora-List] annotation tags

Greetings,

Can anyone point me to a set of annotation �tags which are commonly used across �corpora projects?

In the area of Field Linguistics, Grammar writing is a process of publishing a description on how a natural language functions—usually as a book. Within this practice of publication it is common to give examples as interlinear �glosses which may be word or morpheme aligned. Over the last 15 or �so years there has been an ad-hoc effort to standardize the tags used to describe �morphemes in these interlinear glosses. This effort has been influenced by something called the Leipzig Glossing Rules (LGR), which provided a suggested list of abbreviations �based on some prior art.

I have noticed that some of these abbreviations �have now surfaced in annotated corpora within the domain of Language Documentation, which frequently uses a tool called ELAN to annotate audio/video texts in under-resourced languages. �

So, within Language Documentation and Field Linguistics one can see the influence of LGR in the types of annotation tags chosen in a corpus. HOWEVER, I am wondering if there is perhaps a different influence for the types of values one might see in corpora of more-resourced languages. That is, is there any continuity in the practice of corpora annotation regardless of the sub-field within linguistics where the corpus might originate? �

Can anyone point me to a set of annotation �tags which are commonly used across �corpora projects?

all the best,

- Hugh Paterson III

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7205 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20201215/9cc566b1/attachment.txt>



More information about the Corpora mailing list