[Corpora-List] Reducing n-gram output

J Washtell lec3jrw at leeds.ac.uk
Wed Oct 29 11:49:30 CET 2008

Quoting Yannick Versley <versley at sfs.uni-tuebingen.de>:

> ...since data-driven approaches like this can easily
> lead onto the slippery slope to cargo-cult science where people blindly use
> nontrivial tool X to achieve a simple problem Y that actually has good
> solutions somewhere else (e.g, X=compression programs, Y=language modeling,
> where the speech community has been working for decades on n-gram- and
> syntax-based language models which also do a much better job at it).

I would whole-heartedly agree with Yannick that there is no sense in applying any method blindly. It is blindness that often holds us back. The sciences are full of very similar approaches to different tasks (and very different approaches to similar tasks) developed entirely independently in their respective domains, and yet which each remain more-or-less oblivious to each other and their respective users. Sometimes these are all but equivalent: generalized mean and minkowski distance; cosine distance and Pearson's correlation. Sometimes they are just surprisingly similar: efforts in language modelling and compression being a case in point. Sometimes they are even in the same domain: Nick Nolte and Gary Busey.

I once viewed the move towards multidisciplinary research as some kind of misplaced scientific political correctness. However, now my present explorations include the ("surprisingly obvious once you consider it") application of a very simple sixty-year old biogeographical method to language modelling. I am now more inclined to think of multidisciplinarity as the messiah of enlightenment (though I dare say I am due for a revision). My take on this advice therefore might be something like: Take X and Y. Try and ascertain their individual advantages, limitations and similarities with respect to the problem. If neither X or Y are ideal, consider if they suggest a third, better, approach Z. Check to see if anything like Z has already been discussed anywhere in the (wider) literature. If not try it out. Or if it has, repeat process, with X, Y and now Z. Due to inconsistent nomenclature, and the general isolation of the disciplines in the literature, it makes for very heavy-going research, but I believe that the dividends are [more than] proportionately larger.

I would agree with Yannick that the slippery slope of which he warns is a real danger (one can find one or two people whizzing past on it in the literature). But I might suggest that it would be even less flattering to the science if we were to take a diametrically opposed stance. It would be remiss to imply that compression algorithms, for example, are only deserving of limited investigation in light of NLP's successes without them (and I would resolutely contend that they are non-trivial with comparison to language modelling). Like many established research areas, compression provides a set of broadly applicable tools and knowledge which are readily accessible for exploration and ripe for tearing-apart and re-synthesizing. As long as there is a discerning intellect at the helm, this can only be a good thing.

Justin Washtell University of Leeds

More information about the Corpora mailing list