# [Corpora-List] An ignorant question concerning the basics of statistical significance

Angus Grieve-Smith grvsmth at panix.com
Mon Feb 2 22:23:11 CET 2015

Correct me if I'm misunderstanding, Yannick, but I think you're confusing statistical significance with effect size. Statistical significance simply measures how likely it is that your effect is due to sampling error; effect size actually measures the magnitude of your effect:

On 2/2/2015 2:35 PM, Yannick Versley wrote:
> Dear Georg,
>
> statistical tests are a tool to separate interesting events from
> background noise when
> you cannot get a noise-free and controlled setting.
>
> As an example, let's say you are a seismologist and you want to know
> in or near Vienna, which may occur once every 70 years or so. So you
> put up an earthquake
> detector. You go away for a year, looking at earthquakes elsewhere,
> and you find lots of
> events that your seismograph registered. As it turns out, the
> neighbours' kids were stomping
> through the detector room, and it also picked up the rumblings from a
> particularly loud disco
> night.
>
> Obviously, you're still interested in the earthquakes in Vienna and
> not elsewhere, so you want
> to ignore certain events (those that are "not significant") but keep
> the really big ones. But what
> is "big enough"? So, knowing that you didn't capture an earthquake,
> you have an idea of what
> the uninteresting events (children, disco parties) look like.
>
> Your social scientist friend tells you that you should aim to get a
> level of significance of p<0.05.
> That means, instead of having 20 noise events a year (or a ratio of
> one interesting event for
> every 1400 events detected), this statistical significance filter
> would only let through 5% of the
> non-interesting ones, reducing the ratio from 1:1400 to 1:70.
>
> So you say, well, 1:70 seems still a lot, can I do better? And your
> other friend, who works at a
> particle accelerator says, easy, just try to get a significance level
> of p<0.0001 and you will
> have not 1400 non-interesting events but 0.14. So you look at your
> sample, build a statistical
> model of discos and children and other rumbling things, and you notice
> that you would also
> miss 99% of the earthquakes you're interested in. Drat!
>
> But you physicist friend knows a solution: use two earthquake
> detectors, on opposite corners
> of Vienna, and only count events that occur within minutes of each
> other. That way, you can
> eliminate some of the uninteresting events without missing too many
> earthquakes. You set up
> your two-detector setup, sample the noise events you get for a year,
> and you estimate that
> at the old level, where you previously got 20 noise events, now you
> are only getting two of them,
> which means that the same minimum earthquake size that would have
> gotten you a 0.05
> significance threshold will now give you a 0.005 significance
> threshold, getting down your
> ratio from 1:70 to 1:7, and filtering a bit more would get you to a
> significance threshold of
> p<0.005, which gives you a reasonable ratio of false positives (disco)
> to true positives
> (earthquake) of 1 : 0.7. Which means that of 17 events reported
> (within the 700 years or so
> that you have to stick around to see ten earthquakes), ten will be
> earthquakes and seven
> will be other events (big disco events, wars, alien invasions, whatever).
>
> While I hope this story was maths-free and entertaining enough, I hope
> I got my main point
> across: going from "A is bigger than B" to "A is significantly bigger
> than B", you apply a filter
> that reduces your false positives (and, quite possibly, also some of
> the true positives).
> In the 19th century, people did not have large computers, so instead
> of making elaborate
> models of the background noise (aka false positives), they assumed
> that measurement
> errors would always be distributed according to the Gaussian, or
> normal, distribution
> ("Glockenkurve"). Using different starting assumptions, you get
> different rules for some
> summary of your data ("sufficient statistic") and a threshold that
> gets the lower 95%
> (or 99.99%, or whichever part you want to ignore) of the events
> according to your
> "noise" model.
>
> For example, likelihood ratio tests (such as Dunning's G²) and chi²,
> which is an approximation
> to a likelihood ratio test, look at the likelihood of seeing some data
> under one assumption
> with fewer parameters (the "noise" model, where you assume nothing is
> interesting
> and all is the same), and how much more likely your data would be if
> you assume
> there are (also) some interesting differences in there.
> see the original article here: http://www.aclweb.org/anthology/J93-1003
>
> In your case, your H0 (everything is boring) would be that the
> probability of seeing an
> AP in corpus A would be the same as seeing an AP in corpus B; the H1
> (something
> interesting is the case) would say that the probabilities are different.
> Now we're talking probabilities, so we need to say where an AP could
> /potentially/
> occur, and we assume that an AP could attach to any noun or any verb,
> and we
> get 100 nouns/verbs for corpus A with 20 APs and 1000 nouns/verbs for
> corpus B
> with 100 APs.
>
> Calculation for H0: P(data) = p ** 20 * (1-p) ** 80 * p ** 100 * (1-p)
> ** 900
> with p = 120/1100
> Calculation for H1: P'(data) = p1 ** 20 * (1-p1) ** 80 * p2 ** 100 *
> (1-p2) ** 900
> with p1 = 20/100 and p2 = 100/1000
>
> as it turns out, you have a likelihood ratio (i.e.
> p(data|H1)/p(data|H0) ) of around 52,
> and we take 2*log(52)/log(2) and we look up the result in a table for
> a chi2 distribution
> with one degree of freedom and see that it's more than the threshold
> for p<0.005.
>
> The chi² test is a statistic that approximates a likelihood ratio test
> while being cheaper
> computationally (not everyone had a programmable calculator in the
> 19th century)
> but which similarly uses the chi² distribution. (Basically, if you
> make a coordinate
> cross, and move the pencil up or down according to one normal
> distribution and
> then left or right according to another, uncorrelated, normal
> distribution, the distance
> to the center is distributed according to the chi² distribution with
> two degrees of freedom).
>
> The other two elements in my story that I should point out are:
> statistical tests act like
> a noise filter in that they tune out events below a certain level, but
> they do not change
> the general signal-to-noise ratio, which means that sometimes
> statistical tests will
> filter out a lot of true positives, and that you will encounter a lot
> of false positives waiting
> for your true positive if the latter is rare. There are also often
> other things besides the difference
> in magnitude (e.g. multiple detectors, keeping out the kids) that you
> can do to improve
>
> Best wishes,
> Yannick
>

--

-Angus B. Grieve-Smith

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10077 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/f03b694e/attachment.txt>