[Corpora-List] What sampling size?

Daniel Elmiger Daniel.Elmiger at unige.ch
Sun Aug 9 21:24:02 CEST 2015


In large corpora, it is very often impossible to analyse every single occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data.

I’ve seen several times that samples of 200 occurrences/examples/tokens are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates’ study about „The Semantics of the Modal Auxiliaries“ (1983).

Does anybody know if this kind of sampling has inherent advantages (besides the fact that it reduces the quantity of work)? Are there statistic reasons to take into account 200 tokens? (Why not 100 or 500?)

I’d be grateful for documentation about this (or any other kind of practical) sampling. Thank you in advance!

Regards, Daniel Elmiger

-- University of Geneva

More information about the Corpora mailing list