[Corpora-List] An ignorant question concerning the basics of statistical significance > REPRESENTATIVENESS

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Wed Feb 4 12:06:29 CET 2015

A couple of points that have arisen from off-list conversations:

#1 I have mooted from time to time the possibility of 'life corpora', collections of as much speech and writing by the same person as possible... for longitudinal development studies... eg many universities now have electronic submission, so we some limited aspects covered already... however, we are all so varied in our language production and consumption that using 'life corpora' as a step towards 'representativeness' merely shifts the problem from language usage to people's personal behaviours and circumstantial environments?

#2 i am very keen to encourage every innovation in corpus collection, because i believe in the basic value of adding quantitative to qualitative research. However skewed our datasets, the more sets we construct, the clearer the sources of skewage become?

#3 'modelling' is not how i view my research (although it might be what i actually do)... i just enjoy perceiving patterns in the language data i look at, and imagining the processes that might give rise to such patterns in such snapshots...

#4 i am all in favour of people constructing corpora in a variety of ways, for different purposes, and i think the main intellectual exercise is in making statements that are as tightly appropriate to the corpus contents as possible, and in extrapolating along axes in the data that have been clearly identified, and being extremely wary of the distance of the extrapolation from the description...

#5 so: accurate metadata, rigorous description, and cautious extrapolation are my three aims in analysing any dataset... :)

apologies for any repetitions...

best ramesh

