[Corpora-List] POS-tagger maintenance and improvement

Linas Vepstas linasvepstas at gmail.com
Thu Feb 26 21:48:48 CET 2009

2009/2/25 Chris Dyer <redpony at umd.edu>:
> I think Adam brings up an interesting point.  It is certainly the case
> that the corpora/NLP community, unlike the software community and
> free-encyclopedia communities, has failed to benefit from the "bazaar"
> (bizarre?) model of open collaboration that has produced such
> successes as Linux and Wikipedia.  This may be an unavoidable
> situation for a variety a reasons--for example, most useful corpora
> contain copyrighted material, and most NLP software is generated as a
> research effort.  But, I do wonder if a grassroots effort (say, we
> propose a model that would enable incremental improvements to corpora,
> models, and software) might be able to convince LDC, for example, to
> consider hosting a facility for enabling a community updates to widely
> used resources.

Easier said than done. The problem is not so much a lack of interest as a lack of expertise. I maintain the link-grammar parser


and have made an assortment of incremental improvements over time. Every now and then, someone shows up, eager to fix something, and promptly discovers that there's a non-trivial learning curve -- i.e. they won't be able to fix anything unless they spend weeks, or months, studying the system, pondering, experimenting, etc.

Its not like wikipedia, where you can pop in, fix something and in 1/2 hour you're on your way ... it requires a a heavy up-front investment, which in turn implies a long-term commitment (like most software projects) -- so its not for social butterflies.


BTW, I am *very* interested in automatically learning new disjuncts (link-grammar rules) via corpus statistics -- I think this is an excellent line of research, PhD level, for this parser, or any other NLP system, POS tagger, etc.

More information about the Corpora mailing list