Thus, we are back to the position Tony spelt out: we are working with samples that we create with the intention that they will approach (as far as we can manage) an ideal of representativeness, but that we *know* don't actually meet said ideal. That's as true of a random set of web texts as it is of a carefully designed entity like the BNC.
PS see Adam Kilgarriff's paper "Language is never ever ever random". http://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Maximilian Haeussler Sent: 03 February 2015 17:32 To: Jim Fidelholtz Cc: corpora at uib.no; Mcenery, Tony Subject: Re: [Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS
Hi, sorry but I have a stupid question on this thread, from someone who knows very little about corpus statistics: Couldn't people - in addition to running on their favorite corpus - also run on a random sample of a few thousand/100k webpages, e.g. from the commonCrawl data, and compare the results? It doesn't solve the p-Value issue but maybe could give some idea how representative the corpus is relative to the language used on the web?
_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora