[Corpora-List] An ignorant question concerning the basics of statistical significance

Yannick Versley versley at cl.uni-heidelberg.de
Mon Feb 2 20:35:16 CET 2015

Dear Georg,

statistical tests are a tool to separate interesting events from background noise when you cannot get a noise-free and controlled setting.

As an example, let's say you are a seismologist and you want to know about earthquakes in or near Vienna, which may occur once every 70 years or so. So you put up an earthquake detector. You go away for a year, looking at earthquakes elsewhere, and you find lots of events that your seismograph registered. As it turns out, the neighbours' kids were stomping through the detector room, and it also picked up the rumblings from a particularly loud disco night.

Obviously, you're still interested in the earthquakes in Vienna and not elsewhere, so you want to ignore certain events (those that are "not significant") but keep the really big ones. But what is "big enough"? So, knowing that you didn't capture an earthquake, you have an idea of what the uninteresting events (children, disco parties) look like.

Your social scientist friend tells you that you should aim to get a level of significance of p<0.05. That means, instead of having 20 noise events a year (or a ratio of one interesting event for every 1400 events detected), this statistical significance filter would only let through 5% of the non-interesting ones, reducing the ratio from 1:1400 to 1:70.

So you say, well, 1:70 seems still a lot, can I do better? And your other friend, who works at a particle accelerator says, easy, just try to get a significance level of p<0.0001 and you will have not 1400 non-interesting events but 0.14. So you look at your sample, build a statistical model of discos and children and other rumbling things, and you notice that you would also miss 99% of the earthquakes you're interested in. Drat!

But you physicist friend knows a solution: use two earthquake detectors, on opposite corners of Vienna, and only count events that occur within minutes of each other. That way, you can eliminate some of the uninteresting events without missing too many earthquakes. You set up your two-detector setup, sample the noise events you get for a year, and you estimate that at the old level, where you previously got 20 noise events, now you are only getting two of them, which means that the same minimum earthquake size that would have gotten you a 0.05 significance threshold will now give you a 0.005 significance threshold, getting down your ratio from 1:70 to 1:7, and filtering a bit more would get you to a significance threshold of p<0.005, which gives you a reasonable ratio of false positives (disco) to true positives (earthquake) of 1 : 0.7. Which means that of 17 events reported (within the 700 years or so that you have to stick around to see ten earthquakes), ten will be earthquakes and seven will be other events (big disco events, wars, alien invasions, whatever).

While I hope this story was maths-free and entertaining enough, I hope I got my main point across: going from "A is bigger than B" to "A is significantly bigger than B", you apply a filter that reduces your false positives (and, quite possibly, also some of the true positives). In the 19th century, people did not have large computers, so instead of making elaborate models of the background noise (aka false positives), they assumed that measurement errors would always be distributed according to the Gaussian, or normal, distribution ("Glockenkurve"). Using different starting assumptions, you get different rules for some summary of your data ("sufficient statistic") and a threshold that gets the lower 95% (or 99.99%, or whichever part you want to ignore) of the events according to your "noise" model.

For example, likelihood ratio tests (such as Dunning's G²) and chi², which is an approximation to a likelihood ratio test, look at the likelihood of seeing some data under one assumption with fewer parameters (the "noise" model, where you assume nothing is interesting and all is the same), and how much more likely your data would be if you assume there are (also) some interesting differences in there. see the original article here: http://www.aclweb.org/anthology/J93-1003

In your case, your H0 (everything is boring) would be that the probability of seeing an AP in corpus A would be the same as seeing an AP in corpus B; the H1 (something interesting is the case) would say that the probabilities are different. Now we're talking probabilities, so we need to say where an AP could *potentially* occur, and we assume that an AP could attach to any noun or any verb, and we get 100 nouns/verbs for corpus A with 20 APs and 1000 nouns/verbs for corpus B with 100 APs.

Calculation for H0: P(data) = p ** 20 * (1-p) ** 80 * p ** 100 * (1-p) ** 900

with p = 120/1100 Calculation for H1: P'(data) = p1 ** 20 * (1-p1) ** 80 * p2 ** 100 * (1-p2) ** 900

with p1 = 20/100 and p2 = 100/1000

as it turns out, you have a likelihood ratio (i.e. p(data|H1)/p(data|H0) ) of around 52, and we take 2*log(52)/log(2) and we look up the result in a table for a chi2 distribution with one degree of freedom and see that it's more than the threshold for p<0.005.

The chi² test is a statistic that approximates a likelihood ratio test while being cheaper computationally (not everyone had a programmable calculator in the 19th century) but which similarly uses the chi² distribution. (Basically, if you make a coordinate cross, and move the pencil up or down according to one normal distribution and then left or right according to another, uncorrelated, normal distribution, the distance to the center is distributed according to the chi² distribution with two degrees of freedom).

The other two elements in my story that I should point out are: statistical tests act like a noise filter in that they tune out events below a certain level, but they do not change the general signal-to-noise ratio, which means that sometimes statistical tests will filter out a lot of true positives, and that you will encounter a lot of false positives waiting for your true positive if the latter is rare. There are also often other things besides the difference in magnitude (e.g. multiple detectors, keeping out the kids) that you can do to improve your signal-to-noise ratio.

Best wishes, Yannick

On Mon, Feb 2, 2015 at 7:11 PM, Marko, Georg (georg.marko at uni-graz.at) < georg.marko at uni-graz.at> wrote:

> Dear corpus linguists,
>
>
>
> I am really embarrassed to ask this, but I really am a statistical tabula
> rasa and a mathematical illiterate. This means I always – more or less –
> understand the chapters on descriptive statistics in all the introductory
> textbooks that I have (tried or started) to read. But then suddenly there
> is a jump in the argumentation in inferential statistics, easy to
> comprehend for those with a mathematical mind, but not for me. Anyway, I
> guess I get the basic idea of statistical significance conceptually at
> least, not in all its mathematical details. I am only talking of
> comparisons, mainly with the help of chi-square testing. But purely
> practically speaking, I am still not sure how to apply this. If I want to
> find out whether any two values are significantly different (and are thus
> probably related to the difference between the two corpora, and not a
> product of chance), I can calculate this with these two values plus the
> sizes of the corpora? (I.e. entering these figures into a statistics
> programme or a chi-square calculators.)
>
>
>
> I know there are a hundred different options depending on whether the
> samples are very small or whether I think they samples do not show normal
> distribution, etc. But basically, can you do it this way with a chi-square
> test and with these values?
>
>
>
> Second question, as far as I understand chi-square is just for nominal
> values, i.e. for things that you can count. So no relative, normalized,
> percentage figures. So if I want to find out whether the average lengths of
> words, sentences, paragraphs are significantly different, what can I do?
> Can I use the frequencies of these units? I mean, the more sentences for
> the same number of words, the shorter the sentences must be on average. If
> the difference in number is significant, does this not have implications
> for the average length as well?
>
>
>
> Third question, if I have subcategories of a phenomenon and I want to find
> out whether the proportions of these subcategories are significantly
> different between two corpora, it makes sense to relate the absolute values
> to the frequencies of the overall category rather than to the overall
> numbers of words in the corpora? So if I want to compare, say, predicative
> and attributive adjective phrases, but corpus A contains twice as many APs
> as corpus B, then any calculation using the overall numbers of words will
> reflect the overall results, but not the relations within the category.
> This is, corpus B may contain a significantly higher proportion of
> predicative APs in comparison to attributive ones even if both categories
> are less common in B.
>
>
>
> Sorry for bothering you with such trivial question. But any response would
> be a great help
>
>
>
>
>
> Thank you
>
>
>
> Georg
>
>
>
>
>
> _______________________________________________