[Corpora-List] POS-tagger maintenance and improvement
linasvepstas at gmail.com
Thu Feb 26 21:48:48 CET 2009
2009/2/25 Chris Dyer <redpony at umd.edu>:
> I think Adam brings up an interesting point. It is certainly the case
> that the corpora/NLP community, unlike the software community and
> free-encyclopedia communities, has failed to benefit from the "bazaar"
> (bizarre?) model of open collaboration that has produced such
> successes as Linux and Wikipedia. This may be an unavoidable
> situation for a variety a reasons--for example, most useful corpora
> contain copyrighted material, and most NLP software is generated as a
> research effort. But, I do wonder if a grassroots effort (say, we
> propose a model that would enable incremental improvements to corpora,
> models, and software) might be able to convince LDC, for example, to
> consider hosting a facility for enabling a community updates to widely
> used resources.
Easier said than done. The problem is not so much a lack
of interest as a lack of expertise. I maintain the
and have made an assortment of incremental
improvements over time. Every now and then,
someone shows up, eager to fix something, and
promptly discovers that there's a non-trivial
learning curve -- i.e. they won't be able to fix
anything unless they spend weeks, or months,
studying the system, pondering, experimenting, etc.
Its not like wikipedia, where you can pop in, fix
something and in 1/2 hour you're on your way ...
it requires a a heavy up-front investment, which
in turn implies a long-term commitment (like most
software projects) -- so its not for social butterflies.
BTW, I am *very* interested in automatically learning
new disjuncts (link-grammar rules) via corpus statistics
-- I think this is an excellent line of research, PhD level,
for this parser, or any other NLP system, POS tagger, etc.
More information about the Corpora