[Corpora-List] POS-tagger maintenance and improvement

Chris Dyer redpony at umd.edu
Wed Feb 25 13:09:08 CET 2009


I think Adam brings up an interesting point. It is certainly the case that the corpora/NLP community, unlike the software community and free-encyclopedia communities, has failed to benefit from the "bazaar" (bizarre?) model of open collaboration that has produced such successes as Linux and Wikipedia. This may be an unavoidable situation for a variety a reasons--for example, most useful corpora contain copyrighted material, and most NLP software is generated as a research effort. But, I do wonder if a grassroots effort (say, we propose a model that would enable incremental improvements to corpora, models, and software) might be able to convince LDC, for example, to consider hosting a facility for enabling a community updates to widely used resources.

Chris

On Wed, Feb 25, 2009 at 11:15 AM, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> All,
>
> My lexicography colleagues and I use POS-tagged corpora all the time, every
> day, and very frequently spot systematic errors. (This is for a range of
> languages, but particularly English.) We would dearly like to be in a
> dialogue with the developers of the POS-tagger and/or the relevant language
> models so the tagger+model could be improved in response to our
> feedback. (We have been using standard models rather than training our
> own.) However it seems, for the taggers and language models we use (mainly
> TreeTagger, also CLAWS) and also for other market leaders, all of which seem
> to be from Universities, the developers have little motivation for
> continuing the improvement of their tagger, since
> incremental improvements do not make for good research papers, so there is
> nowhere for our feedback to go, nor any real prospect of these
> taggers/models improving.
>
> Am I too pessimistic? Are there ways of improving language models other
> than developing bigger and better training corpora - not an exercise we have
> the resources to invest in? Are there commercial taggers I should be
> considering (as, in the commercial world, there is motivation for
> incremental improvements and responding to customer feedback)?
> Responses and ideas most welcome
>
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> ================================================
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



More information about the Corpora mailing list