[Corpora-List] An ignorant question concerning the basics of statistical significance

Marko, Georg (georg.marko@uni-graz.at) georg.marko at uni-graz.at
Mon Feb 2 19:11:56 CET 2015

Dear corpus linguists,

I am really embarrassed to ask this, but I really am a statistical tabula rasa and a mathematical illiterate. This means I always – more or less – understand the chapters on descriptive statistics in all the introductory textbooks that I have (tried or started) to read. But then suddenly there is a jump in the argumentation in inferential statistics, easy to comprehend for those with a mathematical mind, but not for me. Anyway, I guess I get the basic idea of statistical significance conceptually at least, not in all its mathematical details. I am only talking of comparisons, mainly with the help of chi-square testing. But purely practically speaking, I am still not sure how to apply this. If I want to find out whether any two values are significantly different (and are thus probably related to the difference between the two corpora, and not a product of chance), I can calculate this with these two values plus the sizes of the corpora? (I.e. entering these figures into a statistics programme or a chi-square calculators.)

I know there are a hundred different options depending on whether the samples are very small or whether I think they samples do not show normal distribution, etc. But basically, can you do it this way with a chi-square test and with these values?

Second question, as far as I understand chi-square is just for nominal values, i.e. for things that you can count. So no relative, normalized, percentage figures. So if I want to find out whether the average lengths of words, sentences, paragraphs are significantly different, what can I do? Can I use the frequencies of these units? I mean, the more sentences for the same number of words, the shorter the sentences must be on average. If the difference in number is significant, does this not have implications for the average length as well?

Third question, if I have subcategories of a phenomenon and I want to find out whether the proportions of these subcategories are significantly different between two corpora, it makes sense to relate the absolute values to the frequencies of the overall category rather than to the overall numbers of words in the corpora? So if I want to compare, say, predicative and attributive adjective phrases, but corpus A contains twice as many APs as corpus B, then any calculation using the overall numbers of words will reflect the overall results, but not the relations within the category. This is, corpus B may contain a significantly higher proportion of predicative APs in comparison to attributive ones even if both categories are less common in B.

Sorry for bothering you with such trivial question. But any response would be a great help

Thank you


-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7882 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/7ff421a1/attachment.txt>

More information about the Corpora mailing list