[Corpora-List] Google Books, copyrights, and corpora

Eric Atwell eric at comp.leeds.ac.uk
Wed Jun 14 18:21:01 CEST 2006


Mark,
If I understand correctly, you want to develop a free online concordancer
for a large copyrighted text corpus. Rather than guess the legality,
I suggest you contact google research direct with your proposal
and curry their support. Maybe they might award you a research grant

:-)

but even if not, it sounds like your project could be "part of" the
google books initiative, and so they can deal with any legal issues.
Allowing you to sidestep the legal niceties and focus on building a
useful research resource.

good luck!

eric atwell, Leeds University

On Wed, 14 Jun 2006, Mark Davies wrote:


> Most of us are familiar with the Google Books initiative -- the project that will digitize tens of millions of books from several leading libraries (http://books.google.com/intl/en/googlebooks/about.html). Google scans these books and then makes them searchable for end users via the Web.

>

> For copyrighted works, the end users see only a "snippet" view -- similar to what we linguists would call an entry in a KWIC display. This is the line of text containing the word or phrase searched for, and maybe one line of text before and one after.

>

> Google claims that although the entire text is (indexed) on the server, the end user sees only very limited context, and there is therefore no violation of US Fair Use Law. See http://books.google.com/googlebooks/newsviews/legal.html for their legal claims and http://fairuse.stanford.edu/ for US Fair Use Law.

>

> In 2005 Google was sued by the American Association of Publishers, which claimed that the "snippet defense" is not adequate in this case (see http://publishers.org/press/releases.cfm?PressReleaseArticleID=292). The case is still in litigation.

>

> ---

>

> What are the implications of this for corpus creation and use? If Google wins, does it mean that we can include *ANY* texts in a corpus, as long as the end user only has access to short KWIC entries (especially if the search interface prevents them from "chaining" these together to re-create larger strings of text)? I guess I'm interested in this question right now, as I'm considering the legal implications of using a particular text collection (300+ million words) as part of a historical corpus of English.

>

> In the past, we've discussed copyright and we've discussed Google and we've discussed Google copyright issues (see several CORPORA posts in June 2003 relating to cached web pages). But this discussion was before Google announced the Google Books initiative, and before they announced the "snippet defense", which seems to have clear application to what we're doing (or could do) with corpora.

>

> Any comments?

>

> =================================================

> Mark Davies

> Assoc. Prof., Linguistics

> Brigham Young University

> (phone) 801-422-9168 / (fax) 801-422-0906

> http://davies-linguistics.byu.edu

>

> ** Corpus design and use // Linguistic databases **

> ** Historical linguistics // Language variation **

> ** English, Spanish, and Portuguese **

> =================================================

>

>

>


--
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-3435430 FAX: +44-113-3435468 http://www.comp.leeds.ac.uk/eric





More information about the Corpora-archive mailing list