I believe Lou pretty much has it in one. I have had some interesting discussions about representativeness with social scientists. Our approach to representativeness is by necessity more fluid and impressionistic than that used in some social sciences, though it is also similar to that used by other social scientists. To reach perfect representativeness we need, as Ramesh suggests, a good model of what we are representing. However, we have that for language no more than panel surveys (e.g. the UK household survey) has that for the whole of the UK. So we select factors and we work towards making sure that you can study those with that data. So the type of statement Lou made is important - it is useful to know what corpus builders intended you to be able to study using their data in just the same way as it is important to know what a panel survey intended you to be able to study using it. So I would still appeal to representativeness as a notion and as an ideal. But I accept that, when operationalised. corpora (with rare exceptions) approximate to, rather than achieve, this ideal. But this is far from unusual in the social sciences, so we need not hand wring unduly. Anyway, those are my thoughts on the matter Angus, as you asked for them.
________________________________________ From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Lou Burnard [lou.burnard at retired.ox.ac.uk] Sent: 03 February 2015 16:10 To: corpora at uib.no Subject: Re: [Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS
However, just for the record, I don't think the BNC ever claimed to be representative of language usage as a whole. Its design principles, as I understood them at least, were chiefly to give "equal time" to as much of possible of the discernible varieties of late 20th c. English, impressionistically defined, and within the bounds of what was economically feasible at the time. Which is clearly not the same thing at all.
On 03/02/15 15:52, Krishnamurthy, Ramesh wrote:
> Hi Angus
> As we have no adequate way of estimating language usage,
> and corpora are samples of language usage,
> is there any point in discussing 'representativeness' again?
> Or has there been an advance in estimating language usage
> in the past 30 years that I am unaware of?
> Date: Mon, 02 Feb 2015 22:40:04 -0500
> From: Angus Grieve-Smith <grvsmth at panix.com>
> Subject: Re: [Corpora-List] An ignorant question concerning the basics
> of statistical significance
> To: corpora at uib.no
> I know that David Lee had problems with the representativeness of
> the BNC, but I believe that Tony McEnery, at least, is on the list, so
> he can maybe tell us more about why the BNC is representative, and of what.
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora