[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Detmar Meurers dm at sfs.uni-tuebingen.de
Wed Feb 4 00:15:28 CET 2015

Hi Ramesh,

> My answer to Angus's question about 'what an adequate way of estimating
> language usage might be' is that i don't think we do have a way at the moment.
> I would therefore suggest, as a consequence, that editors/reviewers should
> certainly be wary of pushing researchers into providing /p/-values or any
> other statistic, as it encourages people to regard statistical measures
> as of higher value than discursive reports and findings.
> I know several junior researchers who were pressured into providing
> more statistical representations of their results than they
> themselves were happy with. I find a similar elevation of
> 'precision/recall' stats in Computational Linguistics. Both groups
> need to remember that some researchers in them (like me) are
> primarily interested in language studies, and an over-emphasis on
> stats tends to lead us away from this. I agree with Lou and Tony
> that the BNC designers and creators tried their best to collect a
> varied and reasonable corpus of texts and genres that were available
> at the time.
> I myself happily and frequently collect corpora on topics that
> interest me, or create corpora from text collections that i happen
> to notice have become available (eg the court documents relating to
> the Michael Brown shooting). All we can do is to give a rationale
> for the contents of any corpus, and a detailed description of them,
> and mould our linguistic descriptions accordingly, and posit
> generalisations that are a reasonable extrapolation from the data?

It all depend on what we want to model in our research, doesn't it?

Those interested in modeling the cognitive reality of human language use will want to select corpora representative of typical human experience - and one can ground that quite precisely using experiments such as the lexical decision tasks mentioned, or, e.g., by computing surprisal based on different corpora to see how well that fits eye tracking measures such as the Dundee eye tracking data (building on Demberg, Keller & Koller, 2013).

Others, interested in the performance of parsers for a particular application task will want to report precision/recall on a corpus representative of that application domain (and ideally train and test cross-corpus using several independently collected corpora from that domain to avoid overfitting idiosyncrasies of a particular corpus and obtain results that generalize to the targeted domain).

Yet others, interested in, e.g., the characteristics of 19th century compared to 18th century poetry will of course use other criteria for selecting which authors and poems to include - and claims made on the basis of that data should be made with reference to the data selection criteria used.

I guess we're reminding ourselves in this discussion that (different from experiments using hand-designed, carefully-controlled data to prove/disprove one specific hypothesis) corpora are designed to be reused for different purposes and by people other than the ones who compiled it - but "reusers" then need to (and need to be enabled to and be competent to) check whether the selection criteria used in compiling the corpus fit their research purpose. Striving towards bigger, "more broadly representative" corpora is not a solution in that respect - for certain research purposes, such as e.g. probing into Second Language Acquisition, smaller, task-specific corpora arguably are crucial to obtain interpretable evidence speaking to typical SLA research questions.

Best, Detmar

-- Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany

More information about the Corpora mailing list