a few years back, I compiled a massive corpus of Bibles and related texts in a CES-conformant XML format (following Resnik 1996), some also with annotations. For the most part, distributing this corpus would be illegal under European copyright law (and that's why you haven't heard about it), but I realized that there are circumstances which could allow dissemination of a great part of it under an academic license.

Compiling and distributing a web corpus is basically illegal in Europe unless explicitly permitted by an accompanying license. However, US law has the concept of fair use, and if a data provider declares US legislation to apply (e.g., that "[t]hese Terms and Conditions ... are governed by the laws of the State of New York"), we Europeans can rely on the principle of fair use, as well.

According to 17 U.S.C. § 107, "the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright." The intended use is for NLP research, DH scholarship and classroom use, so that would probably not an issue -- and in fact, there is no financial damage whatsoever as this data is freely and redundantly available from the web.

However, am I allowed to distribute this corpus with an explicit license statement? I think CC-BY-NC should protect the intellectual and commercial interests of the creator of the electronic edition and be roughly in the spirit of an academic license, but of course, I'm not the actual owner of the data, but only responsible for its transformation and annotation. I am wondering about the consequences if someone eventually creates an NLP tool chain from this data and uses any models trained on the data in a commercial application. As the original copyright extends to derived works, this would be a clear violation of my license statement, of course, but I would be responsible as I redistributed the data and by transforming it from messy HTML to proper markup, I actually enabled this violation.

