[Corpora-List] Fair use (US) and CC-BY-NC

Christian Chiarcos chiarcos at informatik.uni-frankfurt.de
Sat Apr 15 15:20:15 CEST 2017

Dear colleagues,

a few years back, I compiled a massive corpus of Bibles and related texts in a CES-conformant XML format (following Resnik 1996), some also with annotations. For the most part, distributing this corpus would be illegal under European copyright law (and that's why you haven't heard about it), but I realized that there are circumstances which could allow dissemination of a great part of it under an academic license.

Compiling and distributing a web corpus is basically illegal in Europe unless explicitly permitted by an accompanying license. However, US law has the concept of fair use, and if a data provider declares US legislation to apply (e.g., that "[t]hese Terms and Conditions ... are governed by the laws of the State of New York"), we Europeans can rely on the principle of fair use, as well.

According to 17 U.S.C. § 107, "the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright." The intended use is for NLP research, DH scholarship and classroom use, so that would probably not an issue -- and in fact, there is no financial damage whatsoever as this data is freely and redundantly available from the web.

However, am I allowed to distribute this corpus with an explicit license statement? I think CC-BY-NC should protect the intellectual and commercial interests of the creator of the electronic edition and be roughly in the spirit of an academic license, but of course, I'm not the actual owner of the data, but only responsible for its transformation and annotation. I am wondering about the consequences if someone eventually creates an NLP tool chain from this data and uses any models trained on the data in a commercial application. As the original copyright extends to derived works, this would be a clear violation of my license statement, of course, but I would be responsible as I redistributed the data and by transforming it from messy HTML to proper markup, I actually enabled this violation.

Looking forward to your opinion ;)

Best, Christian -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b mail: chiarcos at informatik.uni-frankfurt.de web: http://acoli.cs.uni-frankfurt.de tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28931

More information about the Corpora mailing list