[Corpora-List] An ignorant question concerning the basics of statistical significance

Marko, Georg (georg.marko@uni-graz.at) georg.marko at uni-graz.at
Mon Feb 2 23:32:37 CET 2015

Thanks a lot to all of you who responded. That was very enlightening, helpful and also very entertaining (I now constantly think of earthquakes in Vienna). I will also look at and consider the sources suggested.

But can I boil it down to a single question? If I am a discourse analysts working with corpora who loves counting linguistic elements and structures, drawing all kinds of fancy conclusions from these, what will I have to do if a journal editor asks me to apply a statistical significance test to these numbers. (This is not the real situation, my problem is more general or long-term, so I’m not looking for a quick fix.) For a long time, I simply rejected the idea, thinking that the sheer numerical difference should speak for itself probably because I was afraid it was too complicated (after some experience with statistics, never fully committed though, i.e. especially reading the introductory textbooks on statistics and language studies by Woods, Fletcher, Hughes and Michael Oakes’, but never applying or practicing the methods, I had got the feeling that it was simply not for me). But then I stumbled across this example in McEnery/Xiao/Tono (2006), where they talk about fucker in the BNC spoken and written part. And they say that it is statistically significant and even give the chi-square score. So I thought that leaving aside the more advanced aspects and the problems, you can compare two values derived from corpora for statistical significance. Unfortunately, their textbook does not describe how they arrived at the value or what they have actually done or which figures have been considered/included. Or it is so plain obvious to the authors and to all readers but me.

I think I understand the principles of such tests for psychology or demographics, etc. where I have a large number of different values. But I have not been sure how this translates to corpus data because it is not clear to me where I include the corpus sizes.

OK, back to the issue and an example. And I take a real example. Modal verbs in two corpora, in C1: 62,000 modal verbs, in C2: 20.000 modal verbs, the sizes of the corpora: C1 = 3.4 million words, C2 = 1.4 million words. Now if I decide that the most straightforward way still is – as the numbers are relatively high – to use chi-square. And to spare me the trouble of calculating it myself, I use one of the chi-square calculators available online. As 2x2 tables is the smallest available one, I will have to take it. Now I have four boxes. Obviously, the first two will take 62,000 and 20,000. In my initial question I suggested that the other two boxes should take the number of words in the corpora, because the calculation must take these into account somehow. Yannick’s and Zoltan’s responses made me think that this was probably a wrong idea, even though not explicitly. The number of words of a corpus are, to a certain extent, the sample size, i.e. the number of values. Then I would have to relate the number of modal verbs to the number of words that are not modal verbs, which together add up to the size of the corpus. Would this then mean that in one of these 2x2 tables, I enter:

62,000 20,000 (3.4 mill. – 62,000) (1.4 mill. – 20,000)

(According to the calculator, this is significant with p < 0.05.)

Thanks once again for your help. It is greatly appreciated


________________________________________ Von: Marc Brysbaert [Marc.Brysbaert at UGent.be] Gesendet: Montag, 02. Februar 2015 21:55 An: Marko, Georg (georg.marko at uni-graz.at) Betreff: Re: [Corpora-List] An ignorant question concerning the basics of statistical significance

Hi Georg,

I shouldn't do this, but here is some publicity for an intro textbook I wrote about basic statistics. Every test is explained step by step with simple numerical examples. In my experience, this is the easiest way to learn what is involved:


Nearly all first year students using the book, get a good grasp of stats. I'm sure you wouldn't do worse :-)

All the best, marc

Quoting "Marko, Georg (georg.marko at uni-graz.at)" <georg.marko at uni-graz.at>:

> Dear corpus linguists,
> I am really embarrassed to ask this, but I really am a statistical
> tabula rasa and a mathematical illiterate. This means I always ?
> more or less ? understand the chapters on descriptive statistics in
> all the introductory textbooks that I have (tried or started) to
> read. But then suddenly there is a jump in the argumentation in
> inferential statistics, easy to comprehend for those with a
> mathematical mind, but not for me. Anyway, I guess I get the basic
> idea of statistical significance conceptually at least, not in all
> its mathematical details. I am only talking of comparisons, mainly
> with the help of chi-square testing. But purely practically
> speaking, I am still not sure how to apply this. If I want to find
> out whether any two values are significantly different (and are thus
> probably related to the difference between the two corpora, and not
> a product of chance), I can calculate this with these two values
> plus the sizes of the corpora? (I.e. entering these figures into a
> statistics programme or a chi-square calculators.)
> I know there are a hundred different options depending on whether
> the samples are very small or whether I think they samples do not
> show normal distribution, etc. But basically, can you do it this way
> with a chi-square test and with these values?
> Second question, as far as I understand chi-square is just for
> nominal values, i.e. for things that you can count. So no relative,
> normalized, percentage figures. So if I want to find out whether the
> average lengths of words, sentences, paragraphs are significantly
> different, what can I do? Can I use the frequencies of these units?
> I mean, the more sentences for the same number of words, the shorter
> the sentences must be on average. If the difference in number is
> significant, does this not have implications for the average length
> as well?
> Third question, if I have subcategories of a phenomenon and I want
> to find out whether the proportions of these subcategories are
> significantly different between two corpora, it makes sense to
> relate the absolute values to the frequencies of the overall
> category rather than to the overall numbers of words in the corpora?
> So if I want to compare, say, predicative and attributive adjective
> phrases, but corpus A contains twice as many APs as corpus B, then
> any calculation using the overall numbers of words will reflect the
> overall results, but not the relations within the category. This is,
> corpus B may contain a significantly higher proportion of
> predicative APs in comparison to attributive ones even if both
> categories are less common in B.
> Sorry for bothering you with such trivial question. But any response
> would be a great help
> Thank you
> Georg

More information about the Corpora mailing list