[Corpora-List] Google Books, copyrights, and corpora

Mark Davies Mark_Davies at byu.edu
Wed Jun 14 17:22:00 CEST 2006

Most of us are familiar with the Google Books initiative -- the project that will digitize tens of millions of books from several leading libraries (http://books.google.com/intl/en/googlebooks/about.html). Google scans these books and then makes them searchable for end users via the Web.

For copyrighted works, the end users see only a "snippet" view -- similar to what we linguists would call an entry in a KWIC display. This is the line of text containing the word or phrase searched for, and maybe one line of text before and one after.

Google claims that although the entire text is (indexed) on the server, the end user sees only very limited context, and there is therefore no violation of US Fair Use Law. See http://books.google.com/googlebooks/newsviews/legal.html for their legal claims and http://fairuse.stanford.edu/ for US Fair Use Law.

In 2005 Google was sued by the American Association of Publishers, which claimed that the "snippet defense" is not adequate in this case (see http://publishers.org/press/releases.cfm?PressReleaseArticleID=292). The case is still in litigation.


What are the implications of this for corpus creation and use? If Google wins, does it mean that we can include *ANY* texts in a corpus, as long as the end user only has access to short KWIC entries (especially if the search interface prevents them from "chaining" these together to re-create larger strings of text)? I guess I'm interested in this question right now, as I'm considering the legal implications of using a particular text collection (300+ million words) as part of a historical corpus of English.

In the past, we've discussed copyright and we've discussed Google and we've discussed Google copyright issues (see several CORPORA posts in June 2003 relating to cached web pages). But this discussion was before Google announced the Google Books initiative, and before they announced the "snippet defense", which seems to have clear application to what we're doing (or could do) with corpora.

Any comments?

Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

More information about the Corpora-archive mailing list