[Corpora-List] POS-tagger maintenance and improvement

Jimmy O'Regan joregan at gmail.com
Thu Feb 26 23:36:51 CET 2009

2009/2/26 Francis Tyers <ftyers at prompsit.com>:
> El jue, 26-02-2009 a las 17:45 +0000, Andras Kornai escribió:
>> On Thu, Feb 26, 2009 at 08:29:53AM +0100, Francis Tyers wrote:
>> > > assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
>> > > which is chock-full of copyrighted text, by the simple expedient of merging
>> > > all text files in one and alphabetically sorting the sentences leaving a
>> > > portion out, which makes it more labor-intensive to restore the original
>> > > documents than keying them in.  This method (blessed by the UPenn lawyers)
>> >
>> > Actually, I was thinking of doing something similar, but was lead to
>> > believe that the text was still copyrighted... even if it was sorted and
>> > thus couldn't be distributed under a free licence -- for example the BSD
>> > or LGPL.
>> >
>> > Do you have by any chance a written statement from the UPenn lawyers
>> > regarding this?
>> They haven't provided a separate written statement (nor have we asked for
>> one) but they did explain their reasoning. Let R be the copyright holder of
>> some work, B be a potential buyer, and M be the maker of the corpus.  The
>> prime mover behind copyright cases is economic harm. As long as M sells
>> copyrighted material, or even gives it away, M is taking away the reason of
>> B to buy from the source that would pay royalties to R, so M is causing
>> economic harm. Here it is clear that no harm is done, since the users of
>> your corpus have not actually gained access to the copyrighted work and the
>> corpus can't be exploited for pirate editions.

I think you'll find that their statement was carefully worded to merely portray the issues in the area without giving any direct, specific advice: these kinds of legal analyses are quite often given to law students to perform: they are not lawyers, and cannot give legal advice.

You only focus on the 'economic harm' aspect; you should also consider that, if any of the publishers also produce corpora, or if any of the translators sell their translation memories, then they have a very real case where you are causing them economic harm.

Economic harm is far from the only factor in copyright; at best, you simply won't be held liable to pay a large amount in damages. Who wants to put the work into compiling a corpus, only to be hit with a cease and desist notice?

More information about the Corpora mailing list