[Corpora-List] Quantitive Corpus Linguistics

Dom Widdows widdows at google.com
Fri Aug 22 06:03:05 CEST 2008


Dear All,

I certainly agree that studying the relevant philosophy has been an important part of many (if not most) successful scientific endeavours, though it can also mislead if applied in the wrong contexts (the same can be said of mathematics.

Peter Helias is not someone I'd come across before, and he's not the easiest to find out about online - I have started a stub Wikipedia article (at http://en.wikipedia.org/wiki/Peter_Helias), but his contribution to the theory of substance and accidence is still unclear to me. Christian scholars often trace this through Aquinas (important in the theory of transubstantiation - body and blood of christ are substance, bread and wine are accidence), and perhaps through Augustine to Aristotle. (I know most of this through dinner conversations with my father, so don't really know the references well). A more pluralistic story might be to trace the influence of Aristotle through Averroes and al-Farabi, who certainly wrote some fascinating things on the way words would become reused, formally or informally, to refer to many different but related concepts - perhaps anticipating generative lexicon theory.

I'm surprised to hear the notion that "collocation is everything" coming through a voice in this tradition, I haven't yet found such arch-empiricist quotes from Helias himself was (but need to find more corpus data here!). I think of this "data is everything, there is no need for a mind" attitude associated with David Hume and the Scottish enlightenment, sometimes described as a kind of reaction to Descartes' "reason is everything" (or at least "I am a thing that thinks", as contrasted with a thing that experiences and learns). Leibniz and Kant are both supposed to have tried to find different middle-grounds between these extremes. (Here I could find probably find good quotes, but it's getting late ... write to me if you want me to try and back this up with sources.)

There are a couple of themes behind this ramble, honest ...

The first is that every branch and period of science struggles over this learning vs. reasoning territory, and we are very much in the midst of this struggle in computational linguistics. If we can learn anything from the story of other sciences (even mechanics), corralling one side or the other into putting their tools away never leads to the full story.

Secondly, there is an Aristotelean theme throughout - Aristotle's influence isn't opposed to Plato's, it emphasizes a framework for learning in a world that still has a lot of underlying form to it.

On 8/21/08, Mike Maxwell <maxwell at umiacs.umd.edu> wrote:
> J Washtell wrote:
> > I find it a bit optimistic (given my own intuitions of course. But I
> > should say that I do not find it beyond the realms of possibility)
> > that the evidence necessary to solve all of our linguistic and
> > (unavoidably?!) cognitive-linguistic ponderings is to be found in the
> > text (not in the brain, say, or in the extra-corporal context).

Hence I agree with this reservation - trying to find everything in the text alone would be like Hume trying to find everything in the data alone without any contribution from reasoning. (Please come out an correct me, Hume scholars, if I'm out of line here.)


> Not to mention that if you limit yourself to studying things that
> require large corpora, you rule out studying perhaps 99% of the
> languages in the world.

This I'd disagree with - you can learn things about the structure of language in general by considering available large corpora, and use this knowledge to try and enhance what you can do with small datasets. Linear B was a comparatively small corpus, but using knowledge of classical Greek, it could be decifered. Perhaps this is a canned example since the languages are in a sense "the same" - but even for completely unrelated languages, a good linguist uses information learned about familiar languages to build expertise on language in general, and can then apply this expertise and technique to fresh languages with small amounts of data. It's only if corpus linguistics explicitly rules out generalization that a strictly empiricist approach leads to no cross-lingual extrapolation.

Best wishes, Dominic



More information about the Corpora mailing list