[Corpora-List] An ignorant question concerning the basics of statistical significance

Angus Grieve-Smith grvsmth at panix.com
Mon Feb 2 22:20:20 CET 2015

On 2/2/2015 1:11 PM, Marko, Georg (georg.marko at uni-graz.at) wrote:
> I know there are a hundred different options depending on whether the
> samples are very small or whether I think they samples do not show
> normal distribution, etc. But basically, can you do it this way with a
> chi-square test and with these values?

An important thing to know about statistical significance is that it depends on having a representative sample. If your corpus is not representative of whatever you want to generalize it to (the whole language, usually), you are simply not justified in generalizing, no matter what the significance tests say. I blogged about this:


That said, many conferences, journals and tenure committees just ignore the whole representativeness thing. Usually when I bring it up here on the corpora list, there's an embarrassed silence, and then a few people just go on talking about measuring the "significance" of non-representative samples. Feel free to do that as usual, everybody.

> Second question, as far as I understand chi-square is just for nominal
> values, i.e. for things that you can count. So no relative,
> normalized, percentage figures. So if I want to find out whether the
> average lengths of words, sentences, paragraphs are significantly
> different, what can I do? Can I use the frequencies of these units? I
> mean, the more sentences for the same number of words, the shorter the
> sentences must be on average. If the difference in number is
> significant, does this not have implications for the average length as
> well?

On the contrary (and I had to look at Wikipedia for this) chi-square is for frequencies, where the value is between 0 and 1. For averages where you expect a normal distribution, you can use Student's /t/-test.

> Third question, if I have subcategories of a phenomenon and I want to
> find out whether the proportions of these subcategories are
> significantly different between two corpora, it makes sense to relate
> the absolute values to the frequencies of the overall category rather
> than to the overall numbers of words in the corpora? So if I want to
> compare, say, predicative and attributive adjective phrases, but
> corpus A contains twice as many APs as corpus B, then any calculation
> using the overall numbers of words will reflect the overall results,
> but not the relations within the category. This is, corpus B may
> contain a significantly higher proportion of predicative APs in
> comparison to attributive ones even if both categories are less common
> in B.

What you really want is the /envelope of variation/: how often does the phenomenon occur relative to the amount of time it has a chance to occur? If predicative and attributive adjective phrases are the only possibilities, you can add up the frequencies and use that as your denominator. It all depends on your hypothesis. I wrote an article about this; if you don't have free access I can send you a copy:


Good luck, Georg!


-Angus B. Grieve-Smith

grvsmth at panix.com

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6562 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/3e7e3fd9/attachment.txt>

More information about the Corpora mailing list