[Corpora-List] POS-tagger maintenance and improvement

Eckhard Bick eckhard.bick at mail.dk
Wed Feb 25 13:06:55 CET 2009


This is an interesting observation.

Maybe one explanation for the lack of response to user-feedback is that it is much harder to make incremental changes to probabilistic / machine-learned systems than to rule-based ones. If a corpus user identifies systematic errors this can - in a rule-based parser - be used to remove errors or add rules, or introduce new lexical sets and categories, while in an ML-system this would have to be done by paying somebody to annotate the changes into a treebank, which is, as you say, unlikely.

Though my view is probably biased, I think this might be an example of the side-effects of using trained systems for corpus work rather than rule-based ones (like AGFL or CG, to name a couple).

Best regards, Eckhard Bick

Adam Kilgarriff wrote:
> All,
> My lexicography colleagues and I use POS-tagged corpora all the time,
> every day, and very frequently spot systematic errors. (This is for a
> range of languages, but particularly English.) We would dearly like
> to be in a dialogue with the developers of the POS-tagger and/or the
> relevant language models so the tagger+model could be improved in
> response to our feedback. (We have been using standard models rather
> than training our own.) However it seems, for the taggers and
> language models we use (mainly TreeTagger, also CLAWS) and also for
> other market leaders, all of which seem to be from Universities, the
> developers have little motivation for continuing the improvement of
> their tagger, since incremental improvements do not make for good
> research papers, so there is nowhere for our feedback to go, nor any
> real prospect of these taggers/models improving.
> Am I too pessimistic? Are there ways of improving language models
> other than developing bigger and better training corpora - not an
> exercise we have the resources to invest in? Are there commercial
> taggers I should be considering (as, in the commercial world, there is
> motivation for incremental improvements and responding to customer
> feedback)?
> Responses and ideas most welcome
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> <mailto:adam at lexmasterclass.com>
> ================================================
> ------------------------------------------------------------------------
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- Eckhard Bick, cand.med., dr.phil. University of Southern Denmark e-mail: eckhard.bick at mail.dk web: http://beta.visl.sdu.dk

More information about the Corpora mailing list