[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Hardie, Andrew a.hardie at lancaster.ac.uk
Tue Feb 3 19:34:46 CET 2015

This doesn't solve the problem, because your proposed comparator web corpus is not representative either, in the relevant sense: while you have randomly selected at the document level, you will be testing at the level of the linguistic feature. (Typically the word). A random sample of 100K texts of, say 2K words each is not a random sample of 200 million words. It is a highly biased sample of 200 million words (because we know in advance that any given word token is not independent of the word tokens that precede it within its text).

Thus, we are back to the position Tony spelt out: we are working with samples that we create with the intention that they will approach (as far as we can manage) an ideal of representativeness, but that we *know* don't actually meet said ideal. That's as true of a random set of web texts as it is of a carefully designed entity like the BNC.



PS see Adam Kilgarriff's paper "Language is never ever ever random". http://kilgarriff.co.uk/Publications/2005-K-lineer.pdf

From: Maximilian Haeussler
To: Jim Fidelholtz
Cc: corpora at uib.no; Mcenery, Tony
Subject: Re: [Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Hi, sorry but I have a stupid question on this thread, from someone who knows very little about corpus statistics: Couldn't people - in addition to running on their favorite corpus - also run on a random sample of a few thousand/100k webpages, e.g. from the commonCrawl data, and compare the results? It doesn't solve the p-Value issue but maybe could give some idea how representative the corpus is relative to the language used on the web?



