I don't think anyone has directly responded to your question here yet. I think what you do below is exactly what many other corpus linguists have been doing with the chi-square test: a 2x2 table containing the number of instances of - word X in corpus A - not word X in corpus A - word X in corpus B - not word X in corpus B.
Others have already pointed out some of the problems with this approach, including the problem of what you are comparing word X against (all other words in the corpus) and the assumption made by the test that the observations follow a certain frequency distribution.
I'd like to raise the related issue of dispersion. Because tests like the chi-square test only use frequencies at the level of the entire corpus, they ignore the dispersion of the word across the texts in the corpus. This leads to spurious results being marked as significant: for instance, in keyword analysis, a word that only occurs in one or two texts may be reported as occurring at a significantly different frequency between the two corpora, which is not what we would usually want.
However, there are tests that do take dispersion into account. They take as input the relative frequency of the word in each text in the corpus. Tests like this include the Wilcoxon rank-sum test, also known as the Mann-Whitney U test, and tests based on the statistical technique of resampling, such as the bootstrap test. There's a Mann-Whitney U calculator here: http://www.socscistatistics.com/tests/mannwhitney/Default.aspx
In sum, if you do use a significance test, I'd recommend a dispersion-aware test to avoid the danger of many false positives. For more information, see e.g. sections 5.1 and 13.2 of my PhD dissertation at http://urn.fi/URN:ISBN:978-951-9040-50-9 and a recent paper of ours, "Significance testing of word frequencies in corpora", at http://dx.doi.org/10.1093/llc/fqu064 (see my home page at http://www.helsinki.fi/varieng/people/varieng_saily.html for a free-access link).
Best wishes, Tanja
On 2015-02-03 00:32, Marko, Georg (georg.marko at uni-graz.at) wrote:
> Thanks a lot to all of you who responded. That was very enlightening, helpful and also very entertaining (I now constantly think of earthquakes in Vienna). I will also look at and consider the sources suggested.
>
> But can I boil it down to a single question? If I am a discourse analysts working with corpora who loves counting linguistic elements and structures, drawing all kinds of fancy conclusions from these, what will I have to do if a journal editor asks me to apply a statistical significance test to these numbers. (This is not the real situation, my problem is more general or long-term, so I’m not looking for a quick fix.) For a long time, I simply rejected the idea, thinking that the sheer numerical difference should speak for itself probably because I was afraid it was too complicated (after some experience with statistics, never fully committed though, i.e. especially reading the introductory textbooks on statistics and language studies by Woods, Fletcher, Hughes and Michael Oakes’, but never applying or practicing the methods, I had got the feeling that it was simply not for me). But then I stumbled across this example in McEnery/Xiao/Tono (2006), where they talk about fuck!
er in the
BNC spoken and written part. And they say that it is statistically significant and even give the chi-square score. So I thought that leaving aside the more advanced aspects and the problems, you can compare two values derived from corpora for statistical significance. Unfortunately, their textbook does not describe how they arrived at the value or what they have actually done or which figures have been considered/included. Or it is so plain obvious to the authors and to all readers but me.
>
> I think I understand the principles of such tests for psychology or demographics, etc. where I have a large number of different values. But I have not been sure how this translates to corpus data because it is not clear to me where I include the corpus sizes.
>
> OK, back to the issue and an example. And I take a real example. Modal verbs in two corpora, in C1: 62,000 modal verbs, in C2: 20.000 modal verbs, the sizes of the corpora: C1 = 3.4 million words, C2 = 1.4 million words. Now if I decide that the most straightforward way still is – as the numbers are relatively high – to use chi-square. And to spare me the trouble of calculating it myself, I use one of the chi-square calculators available online. As 2x2 tables is the smallest available one, I will have to take it. Now I have four boxes. Obviously, the first two will take 62,000 and 20,000. In my initial question I suggested that the other two boxes should take the number of words in the corpora, because the calculation must take these into account somehow. Yannick’s and Zoltan’s responses made me think that this was probably a wrong idea, even though not explicitly. The number of words of a corpus are, to a certain extent, the sample size, i.e. the number of values. Then I!
would hav
e to relate the number of modal verbs to the number of words that are not modal verbs, which together add up to the size of the corpus. Would this then mean that in one of these 2x2 tables, I enter:
>
> 62,000 20,000
> (3.4 mill. – 62,000) (1.4 mill. – 20,000)
>
> (According to the calculator, this is significant with p < 0.05.)
>
> Thanks once again for your help. It is greatly appreciated
>
> Georg
>
>
> ________________________________________
> Von: Marc Brysbaert [Marc.Brysbaert at UGent.be]
> Gesendet: Montag, 02. Februar 2015 21:55
> An: Marko, Georg (georg.marko at uni-graz.at)
> Betreff: Re: [Corpora-List] An ignorant question concerning the basics of statistical significance
>
> Hi Georg,
>
> I shouldn't do this, but here is some publicity for an intro textbook
> I wrote about basic statistics. Every test is explained step by step
> with simple numerical examples. In my experience, this is the easiest
> way to learn what is involved:
>
> http://www.amazon.co.uk/Basic-Statistics-Psychologists-Marc-Brysbaert/dp/0230275427/ref=sr_1_2?ie=UTF8&qid=1332356088&sr=8-2
>
> Nearly all first year students using the book, get a good grasp of
> stats. I'm sure you wouldn't do worse :-)
>
> All the best, marc
>
>
> Quoting "Marko, Georg (georg.marko at uni-graz.at)" <georg.marko at uni-graz.at>:
>
>> Dear corpus linguists,
>>
>>
>>
>> I am really embarrassed to ask this, but I really am a statistical
>> tabula rasa and a mathematical illiterate. This means I always ?
>> more or less ? understand the chapters on descriptive statistics in
>> all the introductory textbooks that I have (tried or started) to
>> read. But then suddenly there is a jump in the argumentation in
>> inferential statistics, easy to comprehend for those with a
>> mathematical mind, but not for me. Anyway, I guess I get the basic
>> idea of statistical significance conceptually at least, not in all
>> its mathematical details. I am only talking of comparisons, mainly
>> with the help of chi-square testing. But purely practically
>> speaking, I am still not sure how to apply this. If I want to find
>> out whether any two values are significantly different (and are thus
>> probably related to the difference between the two corpora, and not
>> a product of chance), I can calculate this with these two values
>> plus the sizes of the corpora? (I.e. entering these figures into a
>> statistics programme or a chi-square calculators.)
>>
>>
>>
>> I know there are a hundred different options depending on whether
>> the samples are very small or whether I think they samples do not
>> show normal distribution, etc. But basically, can you do it this way
>> with a chi-square test and with these values?
>>
>>
>>
>> Second question, as far as I understand chi-square is just for
>> nominal values, i.e. for things that you can count. So no relative,
>> normalized, percentage figures. So if I want to find out whether the
>> average lengths of words, sentences, paragraphs are significantly
>> different, what can I do? Can I use the frequencies of these units?
>> I mean, the more sentences for the same number of words, the shorter
>> the sentences must be on average. If the difference in number is
>> significant, does this not have implications for the average length
>> as well?
>>
>>
>>
>> Third question, if I have subcategories of a phenomenon and I want
>> to find out whether the proportions of these subcategories are
>> significantly different between two corpora, it makes sense to
>> relate the absolute values to the frequencies of the overall
>> category rather than to the overall numbers of words in the corpora?
>> So if I want to compare, say, predicative and attributive adjective
>> phrases, but corpus A contains twice as many APs as corpus B, then
>> any calculation using the overall numbers of words will reflect the
>> overall results, but not the relations within the category. This is,
>> corpus B may contain a significantly higher proportion of
>> predicative APs in comparison to attributive ones even if both
>> categories are less common in B.
>>
>>
>>
>> Sorry for bothering you with such trivial question. But any response
>> would be a great help
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>> Georg
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-- Dr. Tanja Säily Research Unit for Variation, Contacts and Change in English (VARIENG) Department of Modern Languages, University of Helsinki http://www.helsinki.fi/varieng/people/varieng_saily.html