[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Chris Brew christopher.brew at gmail.com
Tue Feb 3 19:09:42 CET 2015

Mathematically, the challenge is that word distributions are heavy-tailed, which means that no matter how many tokens you have in your corpus, there will be a lot of words that you totally miss. That is a fact of life. This problem is even worse if the features that you are looking for depend on several different words. If you want a single corpus that will answer all your present and future questions, you are out of luck. And should probably know better (as many of you clearly do).

What you can do, and what the designers of the Brown Corpus did very well (I think I first learned this from your book, Tony) is to spell out precisely what the sample frame for the corpus was. If you want to replicate the Brown Corpus for a different language, time and place, you have a clear task. Pick a library similar in size to the holdings of the Brown library and the Providence Athenaeum, then identify good working methods for identifying the 15 text categories used in the Brown Corpus (challenge: what is your culture's equivalent of the "western" part of "adventure and western fiction"? It is whatever plays the role that cowboy stories did in the Brown library in 1961). Finally, sample according to the very well specified rules for getting 500 word chunks out of your library in the Brown-mandated proportions. Then do some comparison or other, and prove something about your new language or culture.

What does this show? What does this represent? Damned if I know, but you will have struck a blow for scientific replicability. Did you think of starting with a hypothesis?

Well, OK, that's unfair, but things do get easier if you plan your corpus collection with particular, limited tasks in mind. The BNC is based on the (in my opinion correct) assumption that a corpus collected according to rough and ready sampling from what was available would be lifechangingly useful for very many researchers. My only sadness is that it has proven so hard to get really wide distribution of the audio recordings that underlie the spoken part. This is done now (http://www.phon.ox.ac.uk/AudioBNC) and is excessively cool. But my goodness it would have been so valuable decades ago. Kudos to John Coleman and team.

On 3 February 2015 at 16:57, Mcenery, Tony <a.mcenery at lancaster.ac.uk> wrote:

> Angus, as a P.S., a book I rather like, with a nice chapter on corpus
> construction in it by Bauer and Aarts is:
> Bauer M & G Gaskell (2000) Qualitative researching with text, image and
> sound: a practical handbook. London, Sage
> Worth a look. Best,
> Tony
> ------------------------------
> *From:* corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of
> Angus Grieve-Smith [grvsmth at panix.com]
> *Sent:* 03 February 2015 16:09
> *To:* corpora at uib.no
> *Subject:* Re: [Corpora-List] An ignorant question concerning the basics
> of statistical significance > REPRESENTATIVENESS
> That begs the question of what an adequate way of estimating
> language usage might be. Is there a way to do it at all?
> If not, we should be pushing back harder on editors and reviewers who
> want to see *p*-values.
> On 2/3/2015 10:52 AM, Krishnamurthy, Ramesh wrote:
> Hi Angus
> As we have no adequate way of estimating language usage,
> and corpora are samples of language usage,
> is there any point in discussing 'representativeness' again?
> Or has there been an advance in estimating language usage
> in the past 30 years that I am unaware of?
> best
> ramesh
> --------
> Date: Mon, 02 Feb 2015 22:40:04 -0500
> From: Angus Grieve-Smith <grvsmth at panix.com> <grvsmth at panix.com>
> Subject: Re: [Corpora-List] An ignorant question concerning the basics
> of statistical significance
> To: corpora at uib.no
> I know that David Lee had problems with the representativeness of
> the BNC, but I believe that Tony McEnery, at least, is on the list, so
> he can maybe tell us more about why the BNC is representative, and of what.
> ...
> --
> -Angus B. Grieve-Smith
> grvsmth at panix.com
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6582 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150203/109053b4/attachment.txt>

More information about the Corpora mailing list