[Corpora-List] An ignorant question concerning the basics of statistical significance

Jim Fidelholtz fidelholtz at gmail.com
Mon Feb 2 20:52:50 CET 2015

Hi, Georg,

Look, the short answer is that statistics can mean whatever you want it to mean! Normally, 'statistically significant' is taken to mean that the probability of something *not* being the case is less than .05, that is, one in 20. In other words, if the numbers are close to 1 in 20, you can call it statistically significant, but the probability is that 5% of your sample (eg, your corpus) will in fact *not* be statistically significant. OK, but in principle, 1 in 20 is an arbitrary figure. As long as you are explicit about it, you could take most any fraction you want as the dividing line for 'statistically significant'. Now, this is a bit of a quibble, I'll admit, and you rarely see anybody using anything but the standard .05 as the dividing line. One thing mathematicians (and 'I are one') are careful about is that it makes no sense to call things 'very significant' or 'nearly significant'. If p > .05, never mind by how much, then the result is 'not significant'. If p < .05, again, never mind by how much, then the result is significant, period. Statisticians (and/or liars) may inform you (often proudly) that 'p < .00000001' or the like, but they will never (if they are any good) tell you the results are 'very significant', even though, in the cited example, they might be very certain of the significance of their results.

The so-called 'chi squared' test you mentioned deserves some comment. It is very commonly used, principally, I believe, because the calculations to derive the result are pretty straightforward and even fairly easy to do. However, one has to be very careful in using it. You mention the problem if the numbers (in the cells) are small. In that case, the results may easily get skewed; so, for example, when your tables have some columns or rows consisting principally of cells with very low numbers (0 to 4, say), many people would prefer to combine those columns or rows with an adjacent one to avoid this particular problem. This problem, I should think, is common to most statistical tests. However, it is less common to see emphasis placed on the opposite case: for chi-squared, the test is basically useless when some or most cells have large numbers (greater than or equal to 100). It is useless because, in this case, you are almost guaranteed a result of 'statistically significant'. This is due to the nature of the test, which involves summing the results for all cells of a particular calculation which uses the number in each cell. Basically, each sub-calculation divides the square of that number by the number itself and then multiplies that result by a constant. Clearly, then, if the cell contents (n) are large enough and all calculations are summed up, the result will get quite large. Tables will tell you whether the result is large enough (this depends on the number of cells in the matrix, basically), but it should be clear that, for the chi-squared test, large numbers in the cells will force the result to be 'significant' in almost all cases, regardless of any other considerations. Chi-squared can be a useful test with smallish numbers in each cell, but needs care in its use to avoid the use of cell numbers that are too small or (worse) too large.

Of course, there are other tests which avoid the problem of large numbers in the cells (that is, their utility is not affected by large numbers), but they are often correspondingly difficult (even impractical) to use without a computer. Chi-squared can be done 'manually', which makes it attractive for non-mathematically inclined people.

One other thing, if you want to use chi-squared: the test depends on putting your data into matrices (layouts) which basically compare one category (the columns) with another category (the rows). The minimum useful matrix would basically be 2 x 2. You can set up your categories pretty much any way you want to, but just remember the nerd's basic motto: GIGO, or 'garbage in, garbage out'. I. e., if your categories are nonsensical, so will your results be. This translates into (according to me): yes, you can reduce larger numbers into percentages (and notice that this has the effect of reducing all cell entries to numbers that are less than or equal to 100, thus mitigating the problem I mentioned earlier), but just make sure you are clear on what the numbers you are manipulating represent, or what they mean.

I should point out that, although I am a mathematician by training, I am not an expert on statistics as such, although I have enough experience to know I should be careful when I use statistical tests. Hope this helps.

Jim

James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Mon, Feb 2, 2015 at 12:11 PM, Marko, Georg (georg.marko at uni-graz.at) < georg.marko at uni-graz.at> wrote:

> Dear corpus linguists,
>
>
>
> I am really embarrassed to ask this, but I really am a statistical tabula
> rasa and a mathematical illiterate. This means I always – more or less –
> understand the chapters on descriptive statistics in all the introductory
> textbooks that I have (tried or started) to read. But then suddenly there
> is a jump in the argumentation in inferential statistics, easy to
> comprehend for those with a mathematical mind, but not for me. Anyway, I
> guess I get the basic idea of statistical significance conceptually at
> least, not in all its mathematical details. I am only talking of
> comparisons, mainly with the help of chi-square testing. But purely
> practically speaking, I am still not sure how to apply this. If I want to
> find out whether any two values are significantly different (and are thus
> probably related to the difference between the two corpora, and not a
> product of chance), I can calculate this with these two values plus the
> sizes of the corpora? (I.e. entering these figures into a statistics
> programme or a chi-square calculators.)
>
>
>
> I know there are a hundred different options depending on whether the
> samples are very small or whether I think they samples do not show normal
> distribution, etc. But basically, can you do it this way with a chi-square
> test and with these values?
>
>
>
> Second question, as far as I understand chi-square is just for nominal
> values, i.e. for things that you can count. So no relative, normalized,
> percentage figures. So if I want to find out whether the average lengths of
> words, sentences, paragraphs are significantly different, what can I do?
> Can I use the frequencies of these units? I mean, the more sentences for
> the same number of words, the shorter the sentences must be on average. If
> the difference in number is significant, does this not have implications
> for the average length as well?
>
>
>
> Third question, if I have subcategories of a phenomenon and I want to find
> out whether the proportions of these subcategories are significantly
> different between two corpora, it makes sense to relate the absolute values
> to the frequencies of the overall category rather than to the overall
> numbers of words in the corpora? So if I want to compare, say, predicative
> and attributive adjective phrases, but corpus A contains twice as many APs
> as corpus B, then any calculation using the overall numbers of words will
> reflect the overall results, but not the relations within the category.
> This is, corpus B may contain a significantly higher proportion of
> predicative APs in comparison to attributive ones even if both categories
> are less common in B.
>
>
>
> Sorry for bothering you with such trivial question. But any response would
> be a great help
>
>
>
>
>
> Thank you
>
>
>
> Georg
>
>
>
>
>
> _______________________________________________