[Corpora-List] An ignorant question concerning the basics of statistical significance

Martin Weisser weissermar at gmail.com
Wed Feb 4 04:39:51 CET 2015

Dear all, While this has certainly been an enlightening and entertain discussion of the above issue to some extent, I feel that a) the questions, along with some of the real issues here, haven't really been answered, and b) the discussion has somehow 'drifted off' in the wrong direction, towards representativeness. Coming back to the original issue, a point that has (somewhat too vaguely or indirectly) come up in some of the posts is that most statistics are designed to deal with observations that occur in the 'natural world' where there's little or no element of human control. There, it may well be justifiable to compare one's results to a Normal (Gaussian) Distribution, while in dealing with language, as Zipf showed ages ago, we cannot really assume that anything is normally distributed, and we (thus) have very different probabilities from the ones that occur in any natural events. Language is, after all, influenced by conscious choices that shape its structures and socially accepted rules. Furthermore, as I think Georg wanted to say, anyway, is that it may be questionable to simply relate any measure of significance pertaining to specific linguistic phenomena, such as the use of modals, to overall word frequencies, as this doesn't make sense from a communicative point of view. In other words, modals tend to be used with a specific purpose by speakers, and can thus not simply be compared to other verbs, let alone all other words in a corpus/different corpora. In addition, the issue of normalisation, which is certainly essential here, hasn't been raised either. However, to come back to the point about communicative functions, it really doesn't make sense to compare the use of modals in terms of raw occurrence frequencies, or even relative ones. An approach that would already be more sensible here is to at least compare them relative to the number of (relative) c-units (or perhaps even clauses) in a corpus, even though this would still potentially ignore repetitions due to false starts or other disfluencies in spoken corpora. To make a long story somewhat shorter ;-), what we'd really need is to devise more language-appropriate statistics, as well as apply them to the right units of speech, issues which are, unfortunately, all too frequently ignored in the pursuit of 'proving' anything in language via (largely inappropriate) statistics, simply because the assumption is that this is (even) possible. And, perhaps, once we've all thought about this a little more, agreed on how to do it, and achieved devising better statistics, we could then return to the issue of representativeness? -- Cheers,

Martin ======================== Dr. phil. habil. Martin Weisser Professor Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies 510420 Guangzhou P.R. China Web: martinweisser.org

More information about the Corpora mailing list