[Corpora-List] What sampling size?

Angus Grieve-Smith grvsmth at panix.com
Sun Aug 9 21:54:02 CEST 2015

This is an excellent question, and the answer is seldom "200." It depends on several factors.

The first is that corpora are almost always a sample of some kind. When you generalize from your corpus, what are you generalizing to?

The second is that corpora almost always have some internal structure of their own. If you simply grab 200 occurrences at random, are you oversampling one or more texts, speakers/authors, subgenres? The answer to that depends on your hypothesis, and the theory that it is embedded in.

There is a large literature on sampling in the social sciences. All that stuff about /p/-values and chi-squares is basically aimed at answering your question. It goes back to Laplace's question, "Can we get a good estimate of the population of the French empire without counting everyone?" and Student's question, "How many batches of Guinness do I have to examine to properly evaluate this strain of barley?"

I've written more about sampling on my blog:


but in general I encourage you to consult a statistician with experience in social science sampling. From a glance at your university's website, it looks like you have some good people.


On 8/9/2015 3:24 PM, Daniel Elmiger wrote:
> Hello,
> In large corpora, it is very often impossible to analyse every single occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data.
> I’ve seen several times that samples of 200 occurrences/examples/tokens are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates’ study about „The Semantics of the Modal Auxiliaries“ (1983).
> Does anybody know if this kind of sampling has inherent advantages (besides the fact that it reduces the quantity of work)? Are there statistic reasons to take into account 200 tokens? (Why not 100 or 500?)
> I’d be grateful for documentation about this (or any other kind of practical) sampling. Thank you in advance!
> Regards,
> Daniel Elmiger
> --
> University of Geneva
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-Angus B. Grieve-Smith

grvsmth at panix.com

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3676 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150809/037d9519/attachment.txt>

More information about the Corpora mailing list