[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Jim Fidelholtz fidelholtz at gmail.com
Tue Feb 3 18:25:12 CET 2015


Hi, all,

I've just recently read Kilgarriff & Grefenstette (?1998 or ?2002--don't remember which, but one of them was Kilgarriff et al., not including G.). Anyway, what I came away with was that more or less recent research supports: (1) the notion of 'the Web as corpus' is feasible, real and useful; (2) tools for automatically deciding what language (or even: languages) a site is 'in' have progressed to a useful and usable state (NB: this seems to include 'dialects'); (3) ditto (2) for style and genre distinction (including fields & subfields) tools; and, most importantly (especially because, it seems to me, this is not obviously a foregone conclusion), (4) the bigger the corpus, the more stable statistics based on it become. All of the above bodes well for a positive outcome to this 'representativeness' discussion, eg using the Web as corpus, and likewise for those 'dreamers' who continue to build humongous corpora (go! go!, Mark Davies et al.!!).

In the future; nevertheless, I will have some possibly less positive results to communicate about the lower end of the lexical frequency scale, though very preliminary results still keep this optimistic soul buoyed up.

Jim

James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad Autónoma de Puebla, MÉXICO

On Tue, Feb 3, 2015 at 10:47 AM, Mcenery, Tony <a.mcenery at lancaster.ac.uk> wrote:


> Hi Angus,
>
> I believe Lou pretty much has it in one. I have had some interesting
> discussions about representativeness with social scientists. Our approach
> to representativeness is by necessity more fluid and impressionistic than
> that used in some social sciences, though it is also similar to that used
> by other social scientists. To reach perfect representativeness we need, as
> Ramesh suggests, a good model of what we are representing. However, we have
> that for language no more than panel surveys (e.g. the UK household survey)
> has that for the whole of the UK. So we select factors and we work towards
> making sure that you can study those with that data. So the type of
> statement Lou made is important - it is useful to know what corpus builders
> intended you to be able to study using their data in just the same way as
> it is important to know what a panel survey intended you to be able to
> study using it. So I would still appeal to representativeness as a notion
> and as an ideal. But I accept that, when operationalised. corpora (with
> rare exceptions) approximate to, rather than achieve, this ideal. But this
> is far from unusual in the social sciences, so we need not hand wring
> unduly. Anyway, those are my thoughts on the matter Angus, as you asked for
> them.
>
> Best,
>
> Tony
>
> ________________________________________
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Lou
> Burnard [lou.burnard at retired.ox.ac.uk]
> Sent: 03 February 2015 16:10
> To: corpora at uib.no
> Subject: Re: [Corpora-List] An ignorant question concerning the basics of
> statistical significance > REPRESENTATIVENESS
>
> Hi Ramesh
>
> Hear hear!
>
> However, just for the record, I don't think the BNC ever claimed to be
> representative of language usage as a whole. Its design principles, as I
> understood them at least, were chiefly to give "equal time" to as much
> of possible of the discernible varieties of late 20th c. English,
> impressionistically defined, and within the bounds of what was
> economically feasible at the time. Which is clearly not the same thing
> at all.
>
> Lou
>
>
> On 03/02/15 15:52, Krishnamurthy, Ramesh wrote:
> > Hi Angus
> >
> > As we have no adequate way of estimating language usage,
> > and corpora are samples of language usage,
> > is there any point in discussing 'representativeness' again?
> >
> > Or has there been an advance in estimating language usage
> > in the past 30 years that I am unaware of?
> >
> > best
> > ramesh
> > --------
> > Date: Mon, 02 Feb 2015 22:40:04 -0500
> > From: Angus Grieve-Smith <grvsmth at panix.com>
> > Subject: Re: [Corpora-List] An ignorant question concerning the basics
> > of statistical significance
> > To: corpora at uib.no
> >
> > I know that David Lee had problems with the representativeness of
> > the BNC, but I believe that Tony McEnery, at least, is on the list, so
> > he can maybe tell us more about why the BNC is representative, and of
> what.
> > ...
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6603 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150203/4b78d973/attachment.txt>



More information about the Corpora mailing list