Hi Andrew and colleagues,

> This doesn't solve the problem, because your proposed comparator web
> corpus is not representative either, in the relevant sense: while
> you have randomly selected at the document level, you will be
> testing at the level of the linguistic feature. (Typically the
> word). A random sample of 100K texts of, say 2K words each is not a
> random sample of 200 million words. It is a highly biased sample of
> 200 million words (because we know in advance that any given word
> token is not independent of the word tokens that precede it within
> its text).
> Thus, we are back to the position Tony spelt out: we are working
> with samples that we create with the intention that they will
> approach (as far as we can manage) an ideal of representativeness,
> but that we *know* don't actually meet said ideal. That's as true of
> a random set of web texts as it is of a carefully designed entity
> like the BNC.

with respect to lexical representativeness, one can experimentally ground things in one respect though - and it nicely confirms that size isn't everything: Marc Brysbaert and colleagues show in "Assessing the usefulness of Google Books’ word frequencies for psycholinguistic research on word processing" http://journal.frontiersin.org/Journal/10.3389/fpsyg.2011.00027/full that the SUBTLEX-US corpus (51 million words) they compiled from TV and movie subtitles is more representative than the huge Google corpus (131 billion words) in terms of explaining lexical decision times. The difference is made very concrete: "the Google American English frequencies explain 11% less of the variance in the lexical decision times from the English Lexicon Project".

In other words: the language in SUBTLEX-US is more representative of the typical language experience of undergraduate students (as the usual participants in word recognition experiments). So some aspects of corpus representativeness can be made very concrete by grounding them cognitively.

All the best, Detmar

