[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Maximilian Haeussler max at soe.ucsc.edu
Tue Feb 3 18:32:03 CET 2015

Hi, sorry but I have a stupid question on this thread, from someone who knows very little about corpus statistics: Couldn't people - in addition to running on their favorite corpus - also run on a random sample of a few thousand/100k webpages, e.g. from the commonCrawl data, and compare the results? It doesn't solve the p-Value issue but maybe could give some idea how representative the corpus is relative to the language used on the web?

cheers Max

