[Corpora-List] FW: What sampling size?

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Mon Aug 10 12:43:04 CEST 2015

hi Daniel

#1 angus raises the legitimate problem of the skewing of randomly selected examples through disproportionate occurrences of a language feature in one source text or a subset of source texts... however you can monitor such skewing in any corpus software that allows the display of source text ids in the concordance display screen... and therefore adjust for it in your analysis...

#2 there is also the consideration of how accurate you want your analysis/statements to be... working with much smaller corpora, i think Sinclair suggested that our ultimate aim should be to account for every single example... he and others at that time (eg Stubbs, Tognini-Bonelli?) offered an alternative technique: take one concordance screenful, and note the numbers of whichever features you are interested in; take a second screenful and do the same; in general, the proportion of those features will stabilise after a few screenfuls... but more details will appear, indicating sub-features in relation to those features... at some point, when you are satisfied with the depth of analysis of the original features (and here the corpus frequency of the nodeword will be relevant) you can estimate the percentage of the total occurrences that you have analysed, and decide whether the rate of change has stabilised sufficiently for your purposes...?

combining these two strategies (statistical/probability calculations as per angus... plus manual inspection/annotation) may provide the triangulation you need?

best ramesh

________________________________ Date: Sun, 9 Aug 2015 15:54:02 -0400 From: Angus Grieve-Smith <grvsmth at panix.com> Subject: Re: [Corpora-List] What sampling size? To: corpora at uib.no

This is an excellent question, and the answer is seldom "200." It depends on several factors.

The first is that corpora are almost always a sample of some kind. When you generalize from your corpus, what are you generalizing to?

The second is that corpora almost always have some internal structure of their own. If you simply grab 200 occurrences at random, are you oversampling one or more texts, speakers/authors, subgenres? The answer to that depends on your hypothesis, and the theory that it is embedded in.

There is a large literature on sampling in the social sciences. All that stuff about /p/-values and chi-squares is basically aimed at answering your question. It goes back to Laplace's question, "Can we get a good estimate of the population of the French empire without counting everyone?" and Student's question, "How many batches of Guinness do I have to examine to properly evaluate this strain of barley?"

I've written more about sampling on my blog:


but in general I encourage you to consult a statistician with experience in social science sampling. From a glance at your university's website, it looks like you have some good people.


On 8/9/2015 3:24 PM, Daniel Elmiger wrote:
> Hello,
> In large corpora, it is very often impossible to analyse every single occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data.
> I?ve seen several times that samples of 200 occurrences/examples/tokens are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates? study about ?The Semantics of the Modal Auxiliaries? (1983).
> Does anybody know if this kind of sampling has inherent advantages (besides the fact that it reduces the quantity of work)? Are there statistic reasons to take into account 200 tokens? (Why not 100 or 500?)
> I?d be grateful for documentation about this (or any other kind of practical) sampling. Thank you in advance!
> Regards,
> Daniel Elmiger
> --
> University of Geneva
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-Angus B. Grieve-Smith

grvsmth at panix.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6089 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150810/6251c69a/attachment.txt>

More information about the Corpora mailing list