[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Tue Feb 3 22:04:45 CET 2015


Hi all

My answer to Angus's question about 'what an adequate way of estimating language usage might be' is that i don't think we do have a way at the moment.

I would therefore suggest, as a consequence, that editors/reviewers should certainly be wary of pushing researchers into providing /p/-values or any other statistic, as it encourages people to regard statistical measures as of higher value than discursive reports and findings.

I know several junior researchers who were pressured into providing more statistical representations of their results than they themselves were happy with. I find a similar elevation of 'precision/recall' stats in Computational Linguistics. Both groups need to remember that some researchers in them (like me) are primarily interested in language studies, and an over-emphasis on stats tends to lead us away from this.

I agree with Lou and Tony that the BNC designers and creators tried their best to collect a varied and reasonable corpus of texts and genres that were available at the time.

I myself happily and frequently collect corpora on topics that interest me, or create corpora from text collections that i happen to notice have become available (eg the court documents relating to the Michael Brown shooting). All we can do is to give a rationale for the contents of any corpus, and a detailed description of them, and mould our linguistic descriptions accordingly, and posit generalisations that are a reasonable extrapolation from the data?

best ramesh

----- Date: Tue, 03 Feb 2015 11:09:19 -0500 From: Angus Grieve-Smith <grvsmth at panix.com> Subject: Re: [Corpora-List] An ignorant question concerning the basics

of statistical significance > REPRESENTATIVENESS To: "corpora at uib.no" <corpora at uib.no>

That begs the question of what an adequate way of estimating language usage might be. Is there a way to do it at all?

If not, we should be pushing back harder on editors and reviewers who want to see /p/-values. ----- Date: Tue, 03 Feb 2015 16:10:49 +0000 From: Lou Burnard <lou.burnard at retired.ox.ac.uk> Subject: Re: [Corpora-List] An ignorant question concerning the basics

of statistical significance > REPRESENTATIVENESS To: corpora at uib.no

Hi Ramesh

Hear hear!

However, just for the record, I don't think the BNC ever claimed to be representative of language usage as a whole. Its design principles, as I understood them at least, were chiefly to give "equal time" to as much of possible of the discernible varieties of late 20th c. English, impressionistically defined, and within the bounds of what was economically feasible at the time. Which is clearly not the same thing at all.

Lou ----- Date: Tue, 3 Feb 2015 16:47:36 +0000 From: "Mcenery, Tony" <a.mcenery at lancaster.ac.uk> Subject: Re: [Corpora-List] An ignorant question concerning the basics

of statistical significance > REPRESENTATIVENESS To: "corpora at uib.no" <corpora at uib.no>

Hi Angus,

I believe Lou pretty much has it in one. I have had some interesting discussions about representativeness with social scientists. Our approach to representativeness is by necessity more fluid and impressionistic than that used in some social sciences, though it is also similar to that used by other social scientists. To reach perfect representativeness we need, as Ramesh suggests, a good model of what we are representing. However, we have that for language no more than panel surveys (e.g. the UK household survey) has that for the whole of the UK. So we select factors and we work towards making sure that you can study those with that data. So the type of statement Lou made is important - it is useful to know what corpus builders intended you to be able to study using their data in just the same way as it is important to know what a panel survey intended you to be able to study using it. So I would still appeal to representativeness as a notion and as an ideal. But I accept that, when operationalised. corpora (with rare exceptions) approximate to, rather than achieve, this ideal. But this is far from unusual in the social sciences, so we need not hand wring unduly. Anyway, those are my thoughts on the matter Angus, as you asked for them.

Best,

Tony ------ Date: Tue, 3 Feb 2015 16:57:17 +0000 From: "Mcenery, Tony" <a.mcenery at lancaster.ac.uk> Subject: Re: [Corpora-List] An ignorant question concerning the basics

of statistical significance > REPRESENTATIVENESS To: "corpora at uib.no" <corpora at uib.no>

Angus, as a P.S., a book I rather like, with a nice chapter on corpus construction in it by Bauer and Aarts is:

Bauer M & G Gaskell (2000) Qualitative researching with text, image and sound: a practical handbook. London, Sage

Worth a look. Best,

Tony ------



More information about the Corpora mailing list