[Corpora-List] copyright issues

Christian Chiarcos christian.chiarcos at web.de
Fri Feb 27 13:21:00 CET 2009

Dear Andreas Kornai,

> getting hit with all kinds of notices, being held liable for vast
> amounts of damages, and in the end getting tarred, feathered, and ran
> out of town. I would be interested in hearing about any such cases.

Talking about newspaper corpora, such things actually happen, but seem to be normally solved quietly after the corpus distributors get an official warning in written from. The problem is that often, the rights to use published newspaper articles are handed over to specialized content redistribution companies since the outsourcing boom in the 1990s. And these are quite sensitive about copyright as it is fundamental to their business model. Examples for German content redistribution companies are http://www.pressemonitor.de and http://www.vgwort.de.

I know about at least one such case, where a corpus was built and made accessible without written permission. I think they even got an oral confirmation to use the data when they started their work, but later the responsibles at the publishing house couldn't recall, maybe because responsibilities changed because of internal re-organization. So, years after, they were confronted with a huge compensation fee (and publication restrictions, I think, as well).

At another occasion (different people, same publisher), the publisher was contacted in advance. They explicitly allowed the creation of a corpus, but only for the time that the project is running. So, this corpus (that actually already exists) may be neither redistributed nor even stored beyond this specified date. However, as the corpus will receive only partial annotations, this is not so problematic, as only the annotated parts are made available in the end, and in total, this covers less than 15% of the original text. According to (our interpretation of) German copyright law, this is comparable to illustrative examples as those quoted in scientific papers and thus legally unproblematic (if the analogy holds).

> sees scholars sued for publishing
> their corpus, the risk seems to be bearable.

The problem is that we never know about economic models of the future. So, if one day, someone in the management gets the (even misleading) impression that this data becomes economically relevant, their lawyers will certainly find you. This actually happened to the people mentioned above.

The problem is even worse, because it is not entirely clear what counts as a derived work (annotations ? statistical models trained on these ?), and to what degree the copyright owner of the original text also receives a copyright on the derived work. If the corpus data is problematic in its copyright, then derived works may be problematic as well.

At least for this reason, it's safer to ask for a written agreement from the publisher stating explicitly what you're allowed to do with the data. The only legal alternative is to restrict your corpora to illustrative examples, i.e., to use at most a fraction (e.g., <=15% per document as a rule of thumb) of the original text. But even this practice does not guarantee full legal security unless it is confirmed by some kind of verdict.

Best, Christian Chiarcos

