[Corpora-List] POS-tagger maintenance and improvement

Francis Tyers ftyers at prompsit.com
Thu Feb 26 08:29:53 CET 2009


El jue, 26-02-2009 a las 00:17 +0000, Andras Kornai escribió:
> On Thu, Feb 26, 2009 at 12:12:52AM +0100, Francis Tyers wrote:
>
> > The problem at any rate is not with code, there are probably hundreds of
> > POS taggers out there under a wide variety of licences. The problem is
> > with data.
> >
> > You can train a free part-of-speech tagger on a proprietary corpus, or
> > you can train a proprietary part-of-speech tagger on a free corpus... or
> > you could if they existed -- creating POS tagged corpora for a range of
> > languages using either Wikipedia (for you GFDL / CC-BY-SA fans) or
> > Gutenburg (for the public domain / BSD minded) would be a great place to
> > start.
>
> The data problem, even for copyrighted data, is far less for NLP than it is
> usually made out to be. We at the Budapest Institute of Technology
> assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
> which is chock-full of copyrighted text, by the simple expedient of merging
> all text files in one and alphabetically sorting the sentences leaving a
> portion out, which makes it more labor-intensive to restore the original
> documents than keying them in. This method (blessed by the UPenn lawyers)

Actually, I was thinking of doing something similar, but was lead to believe that the text was still copyrighted... even if it was sorted and thus couldn't be distributed under a free licence -- for example the BSD or LGPL.

Do you have by any chance a written statement from the UPenn lawyers regarding this?

Actually, I just looked up the licence agreement for the Hunglish corpus:

"1.2. User shall not publish, retransmit, display, redistribute, reproduce or commercially exploit the Data in any form, except that User may include limited excerpts from the Data in articles, reports and other documents describing the results of User’s linguistic education and research. "

So I guess the answer to my question is no.


> destroys the value of the corpus for discourse analysis or convergence
> studies like Curran and Osborne 2002, but 95% of what we as computational
> linguists do is at or below the sentence level.

True.

Fran



More information about the Corpora mailing list