[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Tue May 31 23:04:00 CEST 2005

Marco Baroni said:


> Consider the following spider trap (quoted in the heritrix

> documentation):


> http://spidrs.must.dye.notttttttt/ [obfuscated]

So, you've just inserted a link to a spider trap into the Corpora-List

> [snip]

> Perhaps, there are other heuristics that are simpler and/or better (any

> suggestion?), but in any case this means that you have to add yet another

> module to your corpus-crawling/processing architecture,

> and if you happen to download a few gigabytes of data from sites like the

> one above things can get really annoying...

If you've received grant money for a proposal in which you made your
entire program of research dependent on the availability of corpus texts
acquired from <spidrs.must.dye.notttttttttt>, then I guess you might have
painted yourself into a corner.

Fortunately, that's seldom going to be the case in real-life corpus
research. If you can't get text from one site, you'll get it from another.

There are lots of possible heuristics that work just fine if all you're
doing is collecting some sample texts for a research corpus, such as
limiting the amount of time you spend harvesting from any given website.

Another processing technique that may perhaps never occur to somebody in
the web-mining/indexing industry would be for the researcher to actually
eyeball the texts that come in to see if the sampling procedure needs to
be enhanced.

Of course there is a high-powered product development industry out there
that couldn't possibly contemplate even a little bit of human intervention
in many of the large-scale, high-performance upstream processing steps.
But that's not what the question starting this thread was about, and it's
not what I've been trying to sketch solution approaches for.

> Moreover, as spammers are getting smarter all the time, anti-spammers are

> also becoming more sophisticated -- suppose that somebody built a spider

> track by generating random _sentences_ instead of words: that would be

> very hard to detect...

Can you show me a list of random sentences that can fool any native
speaker into believing it's a valid text?

You have to get away from the high-tech product development paradigm of
"by human hands untouched" to the scruffy, underfunded, underpowered
paradigm in which undergraduate interns eyeball the results of each
night's run to see if anything obviously bogus came through.

No, you can't do that when you're updating the Google index or building an
exhaustive named entity ontology. But I'm having more and more difficulty
understanding why we can't just focus in this thread on the much
smaller-scale problem actually at hand: on-the-fly capture of sample texts
for a linguistic research corpus.

>> > Incidentally, a "spider trap" query on google returns many more

>> results

>> > about crawlers, robots.txt files etc. than about how to capture

>> > eight-legged arachnids... one good example of how one should be

>> careful

>> > when using the web as a way to gather knowledge about the world...


>> I believe there's a huge difference between using the web as a way to

>> gather knowledge about the world (especially if this is being done

>> automatically) and using the web as a way to populate a corpus for

>> linguistic research. The latter use is much less ambitious, and simply

>> doesn't need to be weighed down by most of the concerns that web-mining

>> or

>> indexing applications do.


> I agree that, as linguists, even if what we get is not corresponding to

> the "truth" in the outside world, we do not need to worry, but factors

> like the distribution of senses of a word in our corpus should be of our

> concern. For example, if I were to extract the semantics of the word

> "spider" from a corpus, I would rather get the eight-legged-creepy-crawly

> creature reference as the central sense. In web-data, this could be

> tricky (of course, I'm not saying that it would be impossible -- I'm just

> saying that one should be a bit careful about what one can find in

> web-data...)

That goes back to my earlier comments about statistical research design.
You can characterize the distribution of senses of a word in a sample, and
make inferences (which may be justifiable inferences if you're a capable
statistician or have one in your project) about the underlying population
from which your sample was drawn.

You cannot, however, make justifiable inferences about supersets of the
underlying population. (That would be an over-generalization.) One
important trick in selling statistical results is being able to
demonstrate that you know what your population is: that you know its
boundary constraints, and that you haven't over-generalized in your

So, with appropriate statistical techniques, you _might_ be able to
characterize the distribution of word senses of "spider" in a sample of
texts captured from the web and then to infer something justifiable about
the distribution of word senses of "spider" in web-served HTML and
plaintext documents (your "underlying population" in the jargon of

But if you tried to sell me an inference from that web sample about the
distribution of word senses of "spider" in written English, much less
English full-stop, then I wouldn't be buying: I'd point out the flaw in
your research design. Such an inference would be over-generalized and
almost certainly not justified on the basis of your sample, because your
sample would not have been representative of written English, much less
English full-stop.

Take a look at research journals in epidemiology, psychology or sociology
and you'll find that this kind of over-generalization, rebuttal and
subsequent redefinition of the underlying population goes on all the time.
It's a natural part of the way science is generally done when statistical
measures are the only way to fly.

>> Most corpus linguists who are constructing a dataset on the fly are just

>> interested


> I am suprised by how you seem to know so much about what corpus linguists

> do and like -- personally, I am not even sure I have understood who

> qualifies as a corpus linguist, yet...

I guess you qualify as a corpus linguist if you spend a not-insignificant
proportion of your time doing corpus linguistics. :)

I've been building computer corpora and the software to acquire, store and
process them off and on since the mid-1970's (you know, back when getting
a grant to purchase Brown or London-Lund on magnetic tape was a Big Deal).
Although it's certainly the case that, if pressed for precision, my idea
of what qualifies as corpus linguistics may differ from that of others
with equal or greater exposure to the field, I guess I'm surprised at the
notion that I wouldn't know corpus linguistics when I see it.

>> and are usually willing to add or change samples indefinitely

>> until their corpus has the characteristics they need.


> In my experience, adding and changing samples indefinitely until I have

> about 1 billion words of web-data with the characteristics I need turns

> out to be a pretty difficult thing to do... if you can suggest a

> procedure to do this in an easy way, I (and, I suspect, "most corpus

> linguists") would be very grateful.

By what procedure did you arrive at 1 billion words as your required
sample size? Why not 500 million or 5 billion?

That said, if you do need a corpus that big and you really don't know how
to build one from web data with the characteristics you need, and you're
reasonably confident that the characteristics can be achieved with a web
sample, then there are probably several of us here who could help you. You
could start a new thread, since that's a very different problem domain
from the one we've been addressing here -- one that would certainly profit
from a high-performance off-the-shelf crawler and other components.

-- Mark

Mark P. Line
San Antonio, TX

More information about the Corpora-archive mailing list