[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Mon May 30 21:43:00 CEST 2005

Dominic Widdows said:

> Mark P. Line said:


>> Actually, I don't think it's really true anymore that large-scale

>> corpus

>> extraction from the Web necessarily puts you at the mercy of commercial

>> search engines. It's no longer very difficult to throw together a

>> software

>> agent that will crawl the Web directly.


> But is it not quite difficult to "throw something together" that

> doesn't cause all sorts of traffic problems? I have always shied away

> from actually trying this, under the impression that it's a bit of a

> dangerous art, but then this is certainly partly due to ignorance.

There's a protocol for robotic web crawlers that you should honor, whereby
websites can specify how they wish such crawlers to behave when their site
is encountered during a crawl. Other than that, I wouldn't worry too much
about traffic caused by your harvesting. Kids build web mining
applications in Java 101 these days. Heck, they're probably doing it in
high school. *shrug*

>> (IOW: The indexing part of commercial search engines may be rocket

>> science, but the harvesting part of them is not.)


> That's intriguing, as someone who's worked more in indexing, I'd have

> said precisely the opposite :-)

> Delighted if I'm wrong.

My take is that indexing can usefully be as (linguistically or otherwise)
sophisticated as anybody cares and has the money to make it (once you've
actually captured the text), whereas harvesting tends to gain little from
anything but the most rudimentary filtering.

> Is there good reliable software out there, for those who would still be

> fearful of hacking up a harvester for themselves?

There are lots of web robots out there. Here's a good starting point:


If you do decide you'd like to roll your own, here's a starting point for


>> I think that if you describe your harvesting procedure accurately (what

>> you seeded it with, and what filters you used if any), and monitor and

>> report on a variety of statistical parameters as your corpus is

>> growing,

>> there's no reason why the resulting data wouldn't serve as an adequate

>> sample for many purposes -- assuming that's what you meant by "vouch

>> for

>> them properly".


> Yes, that is part of what I meant. Do we have a good sense of what

> these statistical parameters should be?

As in all cases of statistical sampling, it depends on the inferences you
hope to be able to justify about the underlying population. My usual
advice is that the research be designed in the following order:

(1) assumptions about the population you wish to characterize;

(2) kinds of characterizations you'd like to be able to make and justify
about the population;

(3) statistical techniques that will provide you with those kinds of
characterizations of a population (given a sample, usually);

(4) sampling requirements of those techniques;

(5) sampling procedures that meet those requirements;

(6) a dataset that was collected by those procedures;

(7) statistical characterization of the sample as required by your
inferential techniques;

(8) inferential results about the population, based on your
characterization of the sample.

You'll need a very different kind of sample if you want to say something
about the passivization of closed-class verbs in English than if you want
to say something about the diffusion of neologisms in English
biotechnology jargon.

> To what extent is there a code of practice for saying exactly what you

> did?

I think that the code of practice should be that of statistics. It's a
well-established practice in most of the other sciences, after all. :)

> Again, we run into

> standard empiricist questions - using your proposal, one could

> guarantee to reproduce the "initial conditions" of someone's

> experiment, but you could at best expect similar outcomes.

Yes. That's very similar to the situation with empirical research in, say,
wetland ecology. Science can progress usefully in either field, even
though nobody would ever expect literally identical outcomes when a study
is replicated.

> This still leaves some of the traditional benefits of corpora

> unaccounted for - what about normalising the text content (presuming

> the traditional notion that text content is the linguistics phenomenon

> you're interested in), tagging, perhaps getting all the data into the

> same character set, etc.?

I don't see how any of that is prevented by harvesting your own set of raw
texts from the Web.

> However, there is still the problem that the more sophisticated stuff

> you throw at your data, the harder it is for anyone to replicate or

> extend your results, and ideally, I would like to see a system where

> the data itself is made available as a standard part of practice.

> Ideally, we would still work on the same datasets if possible, rather

> than duplicating similar datasets for each isoolated project.

That might be laudable, were it not for the fact that different kinds of
questions require different kinds of samples. I think the approach of
providing ever more megalomaniacal Global Universal General-Purpose
Standard corpora has taken us about as far as it's going to. :)

> From an engineering point of view, storage isn't really a problem here,

> but bandwidth is - you have to keep the files you've trawled and

> processed on disk somewhere, but you might not be able to foot the bill

> for other researchers hitting your web server every time they fancy

> half-a-billion words of nice corpus data.

Right. That's why many people (especially non-linguists) use statistics,
and express their findings in statistical (as opposed to fictitiously
absolutist) terms. :)

Replication of statistical results REQUIRES the use of a different sample,
to show that the inferences about the population were not an artefact of
the sampling procedures or of the particular sample obtained for the
original study.

So, the goal would be to express your findings in such a way that they can
be replicated (or not!) statistically by anybody who cares to crank your
methods on a fresh sample from the same population.

-- Mark

Mark P. Line
San Antonio, TX

More information about the Corpora-archive mailing list