[Corpora-List] Fair use (US) and CC-BY-NC

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Sat Apr 22 17:51:48 CEST 2017


Since the discussion on this seems to be so silent, let me say a few words on this topic.

As we all know, the copyright situation with regard to corpora, and especially web corpora, is really very vague, and it differs a lot across different countries (even within the EU, though there are at least some share principles).

Asking lawyers usually just recovers another well know fact, namely that every single one of them will have his or her own different opinion. Most legal advices that you the get stand on the safe side: "you cannot do anything".

So you can either do that, or try to be practical about the issue. You say it is a corpus of Bibles: do you think any of the copyright holders might want to sue you? (why?)

What you are asking about is actually not one question, but very many. Perhaps the most tricky one is indeed the geographic application of the controlling law in the age of internet. You are right that the two licencing schemes do not need to follow the same controlling law, though it is very questionable whether you as a German entity in the position of a licensor can use some other controlling law (in case of a German court ruling, such licence would first be checked whether it is compatible with German copyright law, and if not, default copyright principles would apply instead).

Another question is the one about derived data: there are legal opinions saying that if you take a ready made machine learning tool and train it on some data, the result is NOT subject to the copyright law at all, as it was not produced by a human (and if you look at Google Syntaxnet models which are trained on the UD data with different licences, some of them even with an NC licence, this is exactly it: they do not say ANYTHING about a licence for the models). But to be honest, I would not worry about this part a lot...as long as God doesn't release some third testament and asks whether we people already have tools to PoS-tag automatically ;) -- tools trained on Bibles are highly unlikely to be usable for anything else because of the specificity of that text type.

Best, Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On 18 April 2017 at 16:18, Christian Chiarcos < chiarcos at informatik.uni-frankfurt.de> wrote:


> Am .04.2017, 01:02 Uhr, schrieb Patrick Juola <juola at mathcs.duq.edu>:
>
> The CC-BY-NC does _not_ provide any statement about controlling law (read
> it at https://creativecommons.org/licenses/by-nc/4.0/legalcode) and so I
> don't think you will be able to use this loophole.
>
>
> Actually, the controlling law for assigning my own license should depend
> on my place of residence or that of my licensor. Accordingly, the localized
> versions of CC-BY-NC make reference to national laws (the international
> doesn't, of course), e.g. CC-BY-NC DE:
>
> "Sofern zwischen Ihnen und dem Lizenzgeber keine anderweitige Vereinbarung
> getroffen wurde und soweit Wahlfreiheit besteht, *findet auf diesen
> Lizenzvertrag das Recht der Bundesrepublik Deutschland Anwendung*."
> (https://creativecommons.org/licenses/by-nc/3.0/de/legalcode)
>
> i.e., "applies the law of the Federal Republic of Germany"
>
> But actually, in my understanding, the controlling law regulating my
> original access to the data and the controlling law of me sub-licensing the
> data can be different, as these are two independent legal acts, the first
> involving me and the original data provider (under their law, by their
> request), the second between me and possible users of the corpus (neither
> of which needs to have a relation to the country of the data provider).
>
> Objections?
>
> Christian
>
>
> On Sat, Apr 15, 2017 at 9:20 AM, Christian Chiarcos <
> chiarcos at informatik.uni-frankfurt.de> wrote:
>
>> Dear colleagues,
>>
>> a few years back, I compiled a massive corpus of Bibles and related texts
>> in a CES-conformant XML format (following Resnik 1996), some also with
>> annotations. For the most part, distributing this corpus would be illegal
>> under European copyright law (and that's why you haven't heard about it),
>> but I realized that there are circumstances which could allow dissemination
>> of a great part of it under an academic license.
>>
>> Compiling and distributing a web corpus is basically illegal in Europe
>> unless explicitly permitted by an accompanying license. However, US law has
>> the concept of fair use, and if a data provider declares US legislation to
>> apply (e.g., that "[t]hese Terms and Conditions ... are governed by the
>> laws of the State of New York"), we Europeans can rely on the principle of
>> fair use, as well.
>>
>> According to 17 U.S.C. § 107, "the fair use of a copyrighted work,
>> including such use by reproduction in copies or phonorecords or by any
>> other means specified by that section, for purposes such as criticism,
>> comment, news reporting, teaching (including multiple copies for classroom
>> use), scholarship, or research, is not an infringement of copyright." The
>> intended use is for NLP research, DH scholarship and classroom use, so that
>> would probably not an issue -- and in fact, there is no financial damage
>> whatsoever as this data is freely and redundantly available from the web.
>>
>> However, am I allowed to distribute this corpus with an explicit license
>> statement? I think CC-BY-NC should protect the intellectual and commercial
>> interests of the creator of the electronic edition and be roughly in the
>> spirit of an academic license, but of course, I'm not the actual owner of
>> the data, but only responsible for its transformation and annotation. I am
>> wondering about the consequences if someone eventually creates an NLP tool
>> chain from this data and uses any models trained on the data in a
>> commercial application. As the original copyright extends to derived works,
>> this would be a clear violation of my license statement, of course, but I
>> would be responsible as I redistributed the data and by transforming it
>> from messy HTML to proper markup, I actually enabled this violation.
>>
>> Looking forward to your opinion ;)
>>
>> Best,
>> Christian
>> --
>> Prof. Dr. Christian Chiarcos
>> Applied Computational Linguistics
>> Johann Wolfgang Goethe Universität Frankfurt a. M.
>> 60054 Frankfurt am Main, Germany
>>
>> office: Robert-Mayer-Str. 10, #401b
>> mail: chiarcos at informatik.uni-frankfurt.de
>> web: http://acoli.cs.uni-frankfurt.de
>> tel: +49-(0)69-798-22463
>> fax: +49-(0)69-798-28931
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
>
> --
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos at informatik.uni-frankfurt.de
> web: http://acoli.cs.uni-frankfurt.de
> tel: +49-(0)69-798-22463 <+49%2069%2079822463>
> fax: +49-(0)69-798-28931 <+49%2069%2079828931>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10391 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170422/02b8bb07/attachment.txt>



More information about the Corpora mailing list