[Corpora-List] Query on the use of Google for corpus research
Mark P. Line
mark at polymathix.com
Tue May 31 20:38:00 CEST 2005
Tom Emerson said:
> Mark P. Line writes:
>> I hope that nobody unleashes impolite crawlers anywhere. That's why I
>> noted that there is a protocol that should be honored, so that crawlers
>> behave the way the webmasters wish.
> Following the robots exclusion protocol is only part of the issue,
> though. You also have to be sure you don't hit the site with tens (or
> hundreds) of requests per second (as one example.)
Yes. That's why I said "there's a protocol that should be honored", not
"you should honor the robot exclusion protocol and be done with it". The
protocol followed by polite crawlers includes following the robot
exclusion protocol, avoiding too many requests per second to a single
server, and whatever else the International Brotherhood of Webmasters is
whining about this week. :)
But none of this is new, and none of it is going to be much of a problem
for a researcher who merely wants to capture some sample texts off the
>> Have you ever harvested a linguistic research corpus from the web
>> with that many seed URL's? Why?
> Yes. The work that we're doing in named entity extraction and
> automatic lexicon construction requires gigabytes of data, classified
> by language and in as many genres as possible.
And you believe that's typical for linguists wishing to capture a research
corpus from the Web?
> I have an ongoing crawl
> started from 2300+ seeds where I've so far collected 193 GB of raw
> HTML data, representing just under 9 million documents. The crawl has
> discovered some 21.7 million documents and continues to run.
Yes. Your company develops and markets language-savvy tools that profit
from your named entity extraction and automatically contstructed lexica.
Again, do you believe that's typical for linguists wishing to capture a
research corpus from the Web?
>> What linguistic questions am I looking to answer with my corpus? Is it
>> better if I get less text from more sites or more text from fewer sites?
>> How many seeds did I really start with? Am I following off-site links?
> Exactly: and these questions mean that you need a highly configurable
> crawler that is scalable to thousands or tens of thousands of URLs.
No, they don't mean that. They mean that I need the ability to throw
together (in current buzzword parlance: to agilely construct) a crawler
between now and dinner time that will serve to acquire the corpus I want
to try to gather tonight. If I don't like the results when I see them
tomorrow morning, then I'll tweak and restart.
This may not sound very professional or scalable or high-performance to
you, but it's very close to real life in the Humanities. Most corpus
linguists are not interested in marketable product development, they're
interested in answering research questions about language.
(You'll note that the subject line of this thread still says something
about "corpus research". I didn't think this was ever about
high-performance product development.)
>> Maybe I would keep a list of the pages I'd already seen, and check the
>> list before I requested a page. :)
>> (That might not be a scalable solution for all purposes, but it works
>> at the scale of corpus harvesting.)
> And what scale is that? The space required to track tens or hundreds
> of millions of URLs is significant.
It would be an insignificant burden on leonardo (my Linux machine) to
track hundreds of millions of URL's if I wanted to. In earlier work, I've
designed mechanisms for capturing and processing satellite imagery at the
rate of terabytes PER DAY, and I didn't have nearly the sophisticated
tools at my disposal then that I have under my desk right now.
But, at the risk of repeating myself, that is simply *not* the scale at
which linguistic researchers are typically going to want to operate when
harvesting corpus material from the Web.
>> > Or what if you want to recrawl certain sites with some regularity?
>> Why would I want to do that when my task is to construct a research
>> corpus? Even if I did, it's not exactly rocket surgery. :)
> Because you may be building a synchronic corpus.
I guess I'm going to have to get you to connect the dots for me. How does
revisiting sites with some regularity help me to build a synchronic corpus
in a way that I cannot build it if I never revisit any site again?
Or did you mean a _diachronic_ corpus, in the belief that processes of
language change can usefully be detected by means of periodic scans of
>> > What about sites that require login or cookies?
>> Why would I worry about those sites when I'm just looking to put
>> some sample texts for linguistic research?
> Because you may be building your corpus from sites that require
> registration (think the New York Times, assuming you ignore their
Why would I ignore their robot exclusion rules? This assumption surprises
me, since you have expressed concern that readers of this thread might be
encouraged to do things that webmasters might not like.
In any event, building corpus from sites that require registration is not
a typical need for linguists who just want to gather sample texts for
specific research questions. For most research questions that can be
supported by web-served HTML and plaintext, there's plenty of material out
there without worrying about sites that require registration. That's why I
wrote, "Why would I worry about those sites when I'm just looking to put
together some sample texts for linguistic research?".
>> > How do you schedule the URLs to be crawled?
>> Why would I schedule them if all I'm doing is harvesting corpus texts?
> Because starting with your seeds you will discover many more URLs than
> you can crawl at any one time.
My point has been that I will not generally *need* more URL's than I can
crawl at any one time. I'm not updating the Google index. I'm not
acquiring named entities for an exhaustive lexical database or ontology.
I'm just collecting enough text to answer certain research questions about
my target language.
> Oh, and don't forget that you need to filter the content so that you
> don't download the latest batch of Linux ISOs because some idiot web
> master gave it a mime-type of text/plain. Or, perhaps more
> realistically, so you don't download PDF or Word files (unless you are
> wanting to deal with these.) And filtering on file name regexps (e.g.,
> "\.html?") does not always work, since many sites that may be of
> interest (think message boards) generating content from CGI scripts
> and don't have suffixes.
Filtering on MIME types has always worked for me. Because I'd be building
the crawler in agile fashion (not in a heavier
Analyze-Specify-Design-Build-Test-Deploy-Maintain fashion), I would deal
with filtering out things like mis-typed binary images if and when I
>> _Storing_ the volumes of data that is typical and adequate for corpus
>> linguistics would not be any more difficult when the data is coming from
>> the Web than when it is coming from anywhere else. It's _getting_ the
>> that is different.
> Except we're talking about millions of small files. Few file systems
> handle this well, on any OS.
Why in the world would I store corpus text as millions of small files,
even if I were operating at such a large scale (which, again, again, is
not the typical case I've been advising for here)?
I think we're starting to see the outlines of a paradigm divide here. :)
>> I do know for a fact, however, that corpus linguists do not need
>> high-performance crawlers in order to construct very useful research
>> corpora from the Web.
> Amen. But there is more than just the crawler. Post-processing the
> data is very resource intensive.
Yes. But unlike high-tech companies with a constant concern for
performance and time-to-market, most corpus linguists can happily allow
their Linux machine to crank a dataset for weeks if that's what it takes
(and if they've done enough prior exploration to be reasonably confident
that the process is working correctly).
Many are happy to have gotten the grant money to acquire anything more
than an office computer in the first place.
Mark P. Line
San Antonio, TX
More information about the Corpora-archive