[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Thu Feb 26 01:17:51 CET 2009

On Thu, Feb 26, 2009 at 12:12:52AM +0100, Francis Tyers wrote:
> > is that the GPL basically stands in the way of industry-academia
> > partnerships, FSF claims to the contrary notwithstanding.
> (Insert BSD vs. GPL flame war here)


> There are many counter examples to this, e.g. the previously mentioned
> GrammarSoft, whose VISLCG is GPL and which has disambiguation grammars
> available under a range of licences. There are also plenty of companies
> which make a living using and providing services for GPL software.

This last sentence is what I call "FSF claims to the contrary". Yes you can make a small consulting business based on GPL software, or if you are IBM you may even be able to build a large consulting business that way. (Note that RedHat and other GPL champions are tiny dots on the map of software -- the entire market capitalization of RH at 2.7 g$ is comparable to the annual income of SAS at 2.26 g$.)

Let us grant the point that one can make a GPL business. However, our typical users are telcos and ISPs and other companies whose primary business is not software, let alone software consulting, and they are totally opposed to the idea of opening up their codebase (in part because of security by obscurity reasons, the subject of another worthy flamewar). Perhaps they are wrong-headed, and should open up. However, we feel absolutely no reason to fight this war, our business is with NLP not with free software evangelism.

> The problem at any rate is not with code, there are probably hundreds of
> POS taggers out there under a wide variety of licences. The problem is
> with data.
> You can train a free part-of-speech tagger on a proprietary corpus, or
> you can train a proprietary part-of-speech tagger on a free corpus... or
> you could if they existed -- creating POS tagged corpora for a range of
> languages using either Wikipedia (for you GFDL / CC-BY-SA fans) or
> Gutenburg (for the public domain / BSD minded) would be a great place to
> start.

The data problem, even for copyrighted data, is far less for NLP than it is usually made out to be. We at the Budapest Institute of Technology assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01) which is chock-full of copyrighted text, by the simple expedient of merging all text files in one and alphabetically sorting the sentences leaving a portion out, which makes it more labor-intensive to restore the original documents than keying them in. This method (blessed by the UPenn lawyers) destroys the value of the corpus for discourse analysis or convergence studies like Curran and Osborne 2002, but 95% of what we as computational linguists do is at or below the sentence level.

I would like the main thrust of what I said be not lost in the noise of the flamewar: some kind of clearinghouse for corrected data would be useful. I didn't offer to set one up because I'm not sure Budapest Inst of Tech has the resources (the bottleneck is not server space but the effort it takes to curate the data, wikis are not great for this), but we'd be happy to contribute.

Andras Kornai

> Fran
> PS. One of the things that we've done is decide to use _free_ text for
> performing evaluations. So if you want to e.g. evaluate your MT system
> using post-edition, instead of taking news text from whichever
> newspaper, take the text from Wikipedia, then you can translate,
> post-edit and distribute the resulting parallel aligned corpus free for
> others to use.

More information about the Corpora mailing list