[Corpora-List] POS-tagger maintenance and improvement

Eric Atwell eric at comp.leeds.ac.uk
Wed Feb 25 14:48:54 CET 2009


Majdi Sawalha here at Leeds has volunteered to investigate how easy it is to train TreeTagger for Arabic; and then if this works, hwo he might make use of any feedback you might have on systematic errors. However, I fear this may not be practicable: (i) the Treetagger model may not work for Arabic, and (ii) the model is corpus-derived and so may not be "tweakable" to deal with systematic errors. I *think* the underlying TreeTagger model uses a lexicon and suffix-list to assign one or more possible PoS-tags to each word, then uses a decision-tree (trained on a tagged corpus) to select the best tag compatible with context. BUT Arabic has complex morphology, and a PoS-tag is a bundle of features derived from a bundle of morphemes; many words will not appear in a corpus-derived lexicon, and suffix alone will only be a partial clue to full PoS-tag feature-set. Also, because of the complex morphology, there are a very large number of possible feature-combinations leading to a large PoS-tagset, so even the decision-tree model needs a very large training corpus to avoid training data sparseness.

As others have commented, TreeTagger models for other languages are also derived from a PoS-tagged corpus, whcih suggest the only way to eradicate systematic errors is to "correct" the tagging in the training corpus, or perhaps to use a different corpus altogether.

Eric Atwell, Leeds University

On Wed, 25 Feb 2009, Adam Kilgarriff wrote:

> All,
> My lexicography colleagues and I use POS-tagged corpora all the time,
> every day, and very frequently spot systematic errors.  (This is for a
> range of languages, but particularly English.)   We would dearly like to
> be in a dialogue with the developers of the POS-tagger and/or the
> relevant language models so the tagger+model could be improved in
> response to our feedback. (We have been using standard models rather than
> training our own.)   However it seems, for the taggers and language
> models we use (mainly TreeTagger, also CLAWS) and also for other market
> leaders, all of which seem to be from Universities, the developers have
> little motivation for continuing the improvement of their tagger, since
> incremental improvements do not make for good research papers, so there
> is nowhere for our feedback to go, nor any real prospect of these
> taggers/models improving.
> Am I too pessimistic?  Are there ways of improving language models other
> than developing bigger and better training corpora - not an exercise we
> have the resources to invest in?  Are there commercial taggers I should
> be considering (as, in the commercial world, there is motivation for
> incremental improvements and responding to customer feedback)?
> Responses and ideas most welcome
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff                                    
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================

-- Eric Atwell,

Senior Lecturer, Language research group, School of Computing,

Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England

TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell

More information about the Corpora mailing list