[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Thu Feb 26 18:45:20 CET 2009

On Thu, Feb 26, 2009 at 08:29:53AM +0100, Francis Tyers wrote:
> > assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
> > which is chock-full of copyrighted text, by the simple expedient of merging
> > all text files in one and alphabetically sorting the sentences leaving a
> > portion out, which makes it more labor-intensive to restore the original
> > documents than keying them in. This method (blessed by the UPenn lawyers)
> Actually, I was thinking of doing something similar, but was lead to
> believe that the text was still copyrighted... even if it was sorted and
> thus couldn't be distributed under a free licence -- for example the BSD
> or LGPL.
> Do you have by any chance a written statement from the UPenn lawyers
> regarding this?

They haven't provided a separate written statement (nor have we asked for one) but they did explain their reasoning. Let R be the copyright holder of some work, B be a potential buyer, and M be the maker of the corpus. The prime mover behind copyright cases is economic harm. As long as M sells copyrighted material, or even gives it away, M is taking away the reason of B to buy from the source that would pay royalties to R, so M is causing economic harm. Here it is clear that no harm is done, since the users of your corpus have not actually gained access to the copyrighted work and the corpus can't be exploited for pirate editions.

> Actually, I just looked up the licence agreement for the Hunglish
> corpus:
> "1.2. User shall not publish, retransmit, display, redistribute,
> reproduce or commercially exploit the Data in any form, except that User
> may include limited excerpts from the Data in articles, reports and
> other documents describing the results of User’s linguistic education
> and research. "
> So I guess the answer to my question is no.

This is the generic LDC policy, and again it doesn't enjoin you from the main goal you'd want to use a corpus for, namely training and testing computational linguistic models. Whether using the trained system in a for-profit system would be infringing I'm not sure, IANAL. But the world is full of systems that were optimized on LDC corpora, probably because these works, form an economic standpoint, do not harm the copyright holders. From a legal standpoint I'm not sure, this may even depend on the laws of the country you are in, but in a large corpus the impact of any single work on training is so minimal that "de minimis non curat lex" is probably applicable.

So the WSJ could possibly come after you if you used in a commercial system a model trained only on the WSJ (I say possibly since you still have the "transformative use" defense) but why would you ever want to do such a thing? A pure WSJ model already shows signs of strain on the NYT, and if your goal is a system that works on journalistic prose you are far better off training it on a broad mixture of newspaper sources. If, on the other hand, your goal is to do something value added specifically for WSJ readers, you should be getting the opinion of WSJ lawyers anyway.

Andras Kornai, NAL

PS. In the hope of steering back the conversation to Adam's original point, let me say here that even if one would be inclined to dispute the statement that the use of some copyrighted work is de minimis, surely corrections to this work are de minimis!

More information about the Corpora mailing list