[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Leon Derczynski leon.derczynski at sheffield.ac.uk
Tue Feb 3 21:50:55 CET 2015

There is an analysis of this exact issue in a CoNLL paper from last year, which may prove helpful;


"What’s in a p-value in NLP?" - Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy and Hector Martinez

All the best,


On 3 February 2015 at 20:55, Detmar Meurers <dm at sfs.uni-tuebingen.de> wrote:

> Hi Andrew and colleagues,
> > This doesn't solve the problem, because your proposed comparator web
> > corpus is not representative either, in the relevant sense: while
> > you have randomly selected at the document level, you will be
> > testing at the level of the linguistic feature. (Typically the
> > word). A random sample of 100K texts of, say 2K words each is not a
> > random sample of 200 million words. It is a highly biased sample of
> > 200 million words (because we know in advance that any given word
> > token is not independent of the word tokens that precede it within
> > its text).
> >
> > Thus, we are back to the position Tony spelt out: we are working
> > with samples that we create with the intention that they will
> > approach (as far as we can manage) an ideal of representativeness,
> > but that we *know* don't actually meet said ideal. That's as true of
> > a random set of web texts as it is of a carefully designed entity
> > like the BNC.
> with respect to lexical representativeness, one can experimentally
> ground things in one respect though - and it nicely confirms that size
> isn't everything: Marc Brysbaert and colleagues show in "Assessing the
> usefulness of Google Books’ word frequencies for psycholinguistic
> research on word processing"
> http://journal.frontiersin.org/Journal/10.3389/fpsyg.2011.00027/full
> that the SUBTLEX-US corpus (51 million words) they compiled from TV
> and movie subtitles is more representative than the huge Google corpus
> (131 billion words) in terms of explaining lexical decision times. The
> difference is made very concrete: "the Google American English
> frequencies explain 11% less of the variance in the lexical decision
> times from the English Lexicon Project".
> In other words: the language in SUBTLEX-US is more representative of
> the typical language experience of undergraduate students (as the
> usual participants in word recognition experiments). So some aspects
> of corpus representativeness can be made very concrete by grounding
> them cognitively.
> All the best,
> Detmar
> --
> Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm
> Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- Leon R A Derczynski Research Associate, NLP Group

Department of Computer Science University of Sheffield, UK

Voted number one for student experience Times Higher Education Student Experience Survey 2014-2015

http://www.dcs.shef.ac.uk/~leon/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4343 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150203/a838eb3a/attachment.txt>

More information about the Corpora mailing list