[Corpora-List] POS-tagger maintenance and improvement

Francis Tyers ftyers at prompsit.com
Thu Feb 26 00:12:52 CET 2009

El mié, 25-02-2009 a las 22:53 +0000, Andras Kornai escribió:
> Serge,
> I can't speak for the others but certainly hunpos/hunmorph/hunspell
> and the other hun* tools are very open to user contributions, be they
> algorithmic, lexical, or just bug reports of any sort. It is the very
> nature of trainable tools that they take on the error pattern of the
> training corpora, and we have seen many reports of people hand-correcting
> the training data they are working with, for examle Mikheev (2002) writes
> "We found quite a few infelicities in the original [WSJ corpus]
> tokenization and tagging, however, which we had to correct by hand"
> and we have the same experience with most corpora we use, including
> our own. Creating some kind of clearinghouse or feedback mechanism for
> manual corrections, clever postprocessing hacks etc. would certainly
> have value, as long as these contributions don't carry restrictive
> licensing. There is a minefield here: the SVMTools and the hun* tools
> are LGPL (meaning that industry is welcome to participate) while the
> Stanford tools are GPL, which explicitly forbids incorporation in
> proprietary software. So if you want to send corrections make sure
> they are LGPL.
> Andras Kornai
> PS. Historically, the NLP community used a "give credit but otherwise
> do what you will" license, and the habit of sharing critical material
> (e.g. Henry Spencer's freely redistributable regex(3) or Jorge
> Stolfi's original set of dictionaries) predates the Free Software
> movement. Originally, the emphasis was very much on making sure
> nothing proprietary creeps in, so when the FSF tried to fork ispell
> (the precursor of hunspell) this was very strongly resisted by the
> creators who saw it as an obstacle to truly free use. I personally
> believe that part of the reason why, in Chris Dyer's words,
> "the corpora/NLP community, unlike the software community and
> free-encyclopedia communities, has failed to benefit from the "bazaar"
> (bizarre?) model of open collaboration"

> is that the GPL basically stands in the way of industry-academia
> partnerships, FSF claims to the contrary notwithstanding.

(Insert BSD vs. GPL flame war here)

There are many counter examples to this, e.g. the previously mentioned GrammarSoft, whose VISLCG is GPL and which has disambiguation grammars available under a range of licences. There are also plenty of companies which make a living using and providing services for GPL software.

The problem at any rate is not with code, there are probably hundreds of POS taggers out there under a wide variety of licences. The problem is with data.

You can train a free part-of-speech tagger on a proprietary corpus, or you can train a proprietary part-of-speech tagger on a free corpus... or you could if they existed -- creating POS tagged corpora for a range of languages using either Wikipedia (for you GFDL / CC-BY-SA fans) or Gutenburg (for the public domain / BSD minded) would be a great place to start.


PS. One of the things that we've done is decide to use _free_ text for performing evaluations. So if you want to e.g. evaluate your MT system using post-edition, instead of taking news text from whichever newspaper, take the text from Wikipedia, then you can translate, post-edit and distribute the resulting parallel aligned corpus free for others to use.

More information about the Corpora mailing list