[Corpora-List] State-of-the-art POS tagging results: A summary

Hrafn Loftsson hrafn at ru.is
Thu Nov 27 15:38:57 CET 2008

Hello all.

I was asked to post a summary regarding the following question I posed about 2 weeks ago:

"Can anyone point me to papers presenting state-of-the-art POS tagging results for some morphologically complex languages?

In his paper "Morphological Tagging: Data vs. Dictionaries" (2000), Jan Hajic presents an evaluation for Czech, Estonian, Hungarian Romanian, and Slovene, but I wonder if you know of more recent work."

Thanks to all who responded. Here is an extract from the responses:

------------------------------------------------------------------------ Italian is certainly a morphologically rich language, but I do not know if it is enough complex (in the sense you are interested in)...

In any case last year we set up an evaluation campaign for NLP tools devoted to Italian and one the tasks was pos-tagging.

You can find all the evaluation results in the EVALITA 2007 web site: http://evalita.fbk.eu/2007/ ------------------------------------------------------------------------

Hebrew and Arabic may count under ``morphologically complex languages".

For Hebrew have a look at:

Roy Bar-Haim, Khalil Sima'an and Yoad Winter. Part-of-Speech Tagging of Modern Hebrew Text. In Journal of Natural Language Engineering (J-NLE) <http://www.cambridge.org/journals/journal_catalogue.asp?mnemonic=nle>, 14(2):223-251, 2008.

the work extended for Arabic:

Saib Mansour, Khalil Sima'an and Yoad Winter. Smoothing a Lexicon-based POS tagger for Arabic and Hebrew. In proceedings of ACL 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. Prague, Czech Republic, 2007. ------------------------------------------------------------------------

Here's a paper from three years ago that shows results for Arabic, Korean, and Czech; it does segmentation and tagging within one model.

Context-Based Morphological Disambiguation with Random Fields Noah A. Smith, David A. Smith, and Roy W. Tromble In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 475-482, Vancouver, BC, October 2005. ------------------------------------------------------------------------

you might want to have a look at the COLING paper of Florian Laws and myself this year in which we presented a POS tagger for fine-grained tagsets and evaluated it on German as well as Czech data. http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/COLING08/Schmid-Laws.pdf ------------------------------------------------------------------------

There are some more recent results for Estonian.

There is a paper on statistical tagging of Estonian, by Heiki-Jaan Kaalep, Tarmo Vaino. "Complete Morphological Analysis in the Linguist’s Toolbox." Congressus Nonus Internationalis Fenno-Ugristarum Pars V, pp. 9-16, Tartu 2001. http://www.cl.ut.ee/yllitised/smugri_toolbox_2001.pdf

There are several papers on rule-based tagging of Estonian:

Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit Roosmaa, Heli Uibo. A New Language for Constraint Grammar: Estonian. International Conference Recent Advances in Natural Language Processing. Proceedings. Borovets, Bulgaria, 2003, pp. 304-310. http://math.ut.ee/~kaili/papers/ranlp03.pdf

Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen. Adpositions in Estonian Computational Syntax. Proceedings of the Second ACL-SIGSEM Workshop on The Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications. University of Essex, 19-21 April 2005. Colchester, UK. pp. 2-9. http://www.cs.ut.ee/~kaili/papers/muischneketal.pdf

Kaili Müürisep, Heli Uibo. Shallow Parsing of Spoken Estonian Using Constraint Grammar. Treebanking for Discourse and Speech. Proceed. of NODALIDA 2005 Special Session on Treebanks for Spoken Language and Discourse (ed. Peter Juel Henrichsen and Peter Rossen Skadhauge); Copenhagen Studies in Language 32. Samfundslitteratur. 2006. pp.105-118 http://www.cs.ut.ee/~kaili/papers/myyruiboLatex.pdf ------------------------------------------------------------------------

we did a similar study for Russian recently: http://corpus.leeds.ac.uk/mocky/

There are also more references in the LREC paper available from the same page. ------------------------------------------------------------------------

Please also check the results from the CADIM group at Columbia on morphological disambiguation (POS tagging) for Arabic:

Roth, Ryan, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio. 2008.

Diab, Mona. Towards an optimal POS tag set for Modern Standard Arabic Processing. Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 2007.

Diab, Mona, Kadri Hacioglu and Daniel Jurafsky. Automated Methods for Processing Arabic Text: From Tokenization to Base Phrase Chunking. Book Chapter. In Arabic Computational Morphology: Knowledge-based and Empirical Methods. Editors Antal van den Bosch and Abdelhadi Soudi. Kluwer/Springer Publications, 2007.

Habash, Nizar and Rambow, Owen, 2007. Arabic Diacritization through Full Morphological Tagging. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2007); Companion Volume, Short Papers. [PDF]

Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL05). [PDF]

Diab, Mona, Kadri Hacioglu and Daniel Jurafsky. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. Proceedings of Human Language Technology-North American Association for Computational Linguistics (HLT-NAACL), 2004. ------------------------------------------------------------------------

You can find some recent results on Spanish, Romanian and Polish in:

Grzegorz Chrupała, Georgiana Dinu and Josef van Genabith. 2008. Learning Morphology with Morfette. In Proceedings of LREC 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/594_paper.pdf

There are also some further experiments on those languages as well as Welsh, Irish, Czech and Slovene in Chapter 6 of:

Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. PhD dissertation, Dublin City University. http://www.lsv.uni-saarland.de/personalPages/gchrupala/papers/phd.pdf ------------------------------------------------------------------------

Some updates of that paper of Hajič's you cite can be found at http://ufal.mff.cuni.cz/czech-tagging/. You probably want to look at things of 2005 and onwards. ------------------------------------------------------------------------

-- Regards, Hrafn Loftsson, Ph.D. - www.ru.is/faculty/hrafn Assistant Professor School of Computer Science - www.ru.is/cs Reykjavik University - www.ru.is

Vinsamlega athugiğ ağ upplısingar í tölvupósti şessum og viğhengi eru eingöngu ætlağar şeim sem póstinum er beint til og gætu innihaldiğ upplısingar sem eru trúnağarmál. Sjá nánar: http://www.ru.is/trunadur

Please note that this e-mail and attachments are intended for the named addresses only and may contain information that is confidential and privileged. Further information: http://www.ru.is/trunadur

More information about the Corpora mailing list