[Corpora-List] FW: What sampling size?

Amir Zeldes Amir.Zeldes at georgetown.edu
Tue Aug 11 17:16:06 CEST 2015

Hi Daniel,

Ramesh and Angus have already made excellent points about estimating the stability of the distribution of phenomena within your sample, so I won't say anything about that. But I wanted to add one thing about errors in your search and estimating the error rate.

Especially in a scenario where you run multiple queries that are each meant to give you a count of some variant, you may want to be able to use the entire result set (many more hits than you can read). If you are working on some alternation or a phenomenon that has multiple alternative, known realizations, you may be able to say something using the entire dataset if you have a good idea that each query variant is highly accurate.

I think in these kinds of cases, having a limited, manually analyzed random subset just to estimate the error rate for each query can be very valuable. I also agree that there's no special property for a number like 200; I've used this strategy with 1000 hits per construction and published results using entire datasets when the error rate was below 1% (so fewer than 10 spurious hits per 1000 randomly dispersed items, and then report the fully automatic data for each variant). In any case, the most important thing is to state what you are doing openly - if it doesn't make sense, peer reviewers will let you know :)




Dr. Amir Zeldes

Asst. Prof. for Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057


From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Krishnamurthy, Ramesh Sent: Monday, August 10, 2015 06:43 To: corpora at uib.no Subject: [Corpora-List] FW: What sampling size?

hi Daniel

#1 angus raises the legitimate problem of the skewing of randomly selected examples through disproportionate occurrences of a language feature in one source text or a subset of source texts... however you can monitor such skewing in any corpus software that allows the display of source text ids in the concordance display screen... and therefore adjust for it in your analysis...

#2 there is also the consideration of how accurate you want your analysis/statements to be... working with much smaller corpora, i think Sinclair suggested that our ultimate aim should be to account for every single example... he and others at that time (eg Stubbs, Tognini-Bonelli?) offered an alternative technique: take one concordance screenful, and note the numbers of whichever features you are interested in; take a second screenful and do the same; in general, the proportion of those features will stabilise after a few screenfuls... but more details will appear, indicating sub-features in relation to those features... at some point, when you are satisfied with the depth of analysis of the original features (and here the corpus frequency of the nodeword will be relevant) you can estimate the percentage of the total occurrences that you have analysed, and decide whether the rate of change has stabilised sufficiently for your purposes...?

combining these two strategies (statistical/probability calculations as per angus... plus manual inspection/annotation) may provide the triangulation you need?

best ramesh


Date: Sun, 9 Aug 2015 15:54:02 -0400 From: Angus Grieve-Smith <grvsmth at panix.com> Subject: Re: [Corpora-List] What sampling size? To: corpora at uib.no

This is an excellent question, and the answer is seldom "200." It depends on several factors.

The first is that corpora are almost always a sample of some kind. When you generalize from your corpus, what are you generalizing to?

The second is that corpora almost always have some internal structure of their own. If you simply grab 200 occurrences at random, are you oversampling one or more texts, speakers/authors, subgenres? The answer to that depends on your hypothesis, and the theory that it is embedded in.

There is a large literature on sampling in the social sciences. All that stuff about /p/-values and chi-squares is basically aimed at answering your question. It goes back to Laplace's question, "Can we get a good estimate of the population of the French empire without counting everyone?" and Student's question, "How many batches of Guinness do I have to examine to properly evaluate this strain of barley?"

I've written more about sampling on my blog:


but in general I encourage you to consult a statistician with experience in social science sampling. From a glance at your university's website, it looks like you have some good people.


On 8/9/2015 3:24 PM, Daniel Elmiger wrote:
> Hello,
> In large corpora, it is very often impossible to analyse every single
occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data.
> I?ve seen several times that samples of 200 occurrences/examples/tokens
are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates? study about ?The Semantics of the Modal Auxiliaries? (1983).
> Does anybody know if this kind of sampling has inherent advantages
(besides the fact that it reduces the quantity of work)? Are there statistic reasons to take into account 200 tokens? (Why not 100 or 500?)
> I?d be grateful for documentation about this (or any other kind of
practical) sampling. Thank you in advance!
> Regards,
> Daniel Elmiger
> --
> University of Geneva
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


-Angus B. Grieve-Smith

grvsmth at panix.com

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 12734 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150811/f28026b3/attachment.txt>

More information about the Corpora mailing list