[Corpora-List] Google Books, copyrights, and corpora

Nathan Bauman n.bauman at utoronto.ca
Wed Jun 14 17:57:00 CEST 2006

I'd be interested in hearing how Google is going to stop people from
recreating texts. My gut feeling is that Google is in the wrong on this

An anecdote: My old professor of Religious Studies, Martin Abegg, used
precisely such a concordance to piece together the corpus of Dead Sea
Scrolls for his Ph.D dissertation. A private paper concordance had been
produced by the team in charge of publishing the scrolls; a few copies of
that concordance were lent to various institutions. The one that he used
was freely available on the stacks of the library at Hebrew Union College.
I remember how he told us that the reason he used the concordance to piece
together the texts was because he needed just one text, an unpublished one,
for his dissertation. After he had assembled the entire corpus of texts
known at that time, he was strongly encouraged by various people to publish
all of them, which he eventually did. He was sued, if memory serves
correctly, in both an Israeli court and an American one, but I cannot recall
the outcome of either case. (Eventually, things worked out for him, as he
ended up compiling the index volume to the official publication series some
years later. A young undergrad, I was paid to check the English
transliteration of names for the volume.) Anyway, good luck--and be

Nathan Bauman
General English Program,
Sookmyung Women's University
Seoul, South Korea

----- Original Message -----
From: "Mark Davies" <Mark_Davies at byu.edu>
To: <corpora at hd.uib.no>
Sent: Thursday, June 15, 2006 12:18 AM
Subject: [Corpora-List] Google Books, copyrights, and corpora

> Most of us are familiar with the Google Books initiative -- the project

> that will digitize tens of millions of books from several leading

> libraries (http://books.google.com/intl/en/googlebooks/about.html). Google

> scans these books and then makes them searchable for end users via the

> Web.


> For copyrighted works, the end users see only a "snippet" view -- similar

> to what we linguists would call an entry in a KWIC display. This is the

> line of text containing the word or phrase searched for, and maybe one

> line of text before and one after.


> Google claims that although the entire text is (indexed) on the server,

> the end user sees only very limited context, and there is therefore no

> violation of US Fair Use Law. See

> http://books.google.com/googlebooks/newsviews/legal.html for their legal

> claims and http://fairuse.stanford.edu/ for US Fair Use Law.


> In 2005 Google was sued by the American Association of Publishers, which

> claimed that the "snippet defense" is not adequate in this case (see

> http://publishers.org/press/releases.cfm?PressReleaseArticleID=292). The

> case is still in litigation.


> ---


> What are the implications of this for corpus creation and use? If Google

> wins, does it mean that we can include *ANY* texts in a corpus, as long as

> the end user only has access to short KWIC entries (especially if the

> search interface prevents them from "chaining" these together to re-create

> larger strings of text)? I guess I'm interested in this question right

> now, as I'm considering the legal implications of using a particular text

> collection (300+ million words) as part of a historical corpus of English.


> In the past, we've discussed copyright and we've discussed Google and

> we've discussed Google copyright issues (see several CORPORA posts in June

> 2003 relating to cached web pages). But this discussion was before Google

> announced the Google Books initiative, and before they announced the

> "snippet defense", which seems to have clear application to what we're

> doing (or could do) with corpora.


> Any comments?


> =================================================

> Mark Davies

> Assoc. Prof., Linguistics

> Brigham Young University

> (phone) 801-422-9168 / (fax) 801-422-0906

> http://davies-linguistics.byu.edu


> ** Corpus design and use // Linguistic databases **

> ** Historical linguistics // Language variation **

> ** English, Spanish, and Portuguese **

> =================================================



More information about the Corpora-archive mailing list