[Corpora-List] POS-tagger maintenance and improvement

Francis Tyers ftyers at prompsit.com
Thu Feb 26 21:33:45 CET 2009

El jue, 26-02-2009 a las 17:45 +0000, Andras Kornai escribió:
> On Thu, Feb 26, 2009 at 08:29:53AM +0100, Francis Tyers wrote:
> > > assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
> > > which is chock-full of copyrighted text, by the simple expedient of merging
> > > all text files in one and alphabetically sorting the sentences leaving a
> > > portion out, which makes it more labor-intensive to restore the original
> > > documents than keying them in. This method (blessed by the UPenn lawyers)
> >
> > Actually, I was thinking of doing something similar, but was lead to
> > believe that the text was still copyrighted... even if it was sorted and
> > thus couldn't be distributed under a free licence -- for example the BSD
> > or LGPL.
> >
> > Do you have by any chance a written statement from the UPenn lawyers
> > regarding this?
> They haven't provided a separate written statement (nor have we asked for
> one) but they did explain their reasoning. Let R be the copyright holder of
> some work, B be a potential buyer, and M be the maker of the corpus. The
> prime mover behind copyright cases is economic harm. As long as M sells
> copyrighted material, or even gives it away, M is taking away the reason of
> B to buy from the source that would pay royalties to R, so M is causing
> economic harm. Here it is clear that no harm is done, since the users of
> your corpus have not actually gained access to the copyrighted work and the
> corpus can't be exploited for pirate editions.

This is good reasoning for fair use (which exists in some countries), but see below.

> > Actually, I just looked up the licence agreement for the Hunglish
> > corpus:
> >
> > "1.2. User shall not publish, retransmit, display, redistribute,
> > reproduce or commercially exploit the Data in any form, except that User
> > may include limited excerpts from the Data in articles, reports and
> > other documents describing the results of User’s linguistic education
> > and research. "
> >
> > So I guess the answer to my question is no.
> This is the generic LDC policy,

Which is a standard restrictive "non commercial, research use only" deal.

> and again it doesn't enjoin you from the
> main goal you'd want to use a corpus for, namely training and testing
> computational linguistic models. Whether using the trained system in a
> for-profit system would be infringing I'm not sure, IANAL. But the world is
> full of systems that were optimized on LDC corpora, probably because these
> works, form an economic standpoint, do not harm the copyright holders. From
> a legal standpoint I'm not sure, this may even depend on the laws of the
> country you are in, but in a large corpus the impact of any single work on
> training is so minimal that "de minimis non curat lex" is probably applicable.

It does not allow derivative works. So for example if I want to take the corpus and add some fancy new markup to it, I could not redistribute it[1] under a free software licence (BSD, LGPL, GPL, ...) for others to benefit.


1. For example put it in a public revision control system.

More information about the Corpora mailing list