[Corpora-List] Broader linguistic resources

John F Sowa sowa at bestweb.net
Tue Feb 12 17:05:57 CET 2013

On 2/12/2013 8:55 AM, Dominic P Rout wrote:
> What are some useful, broad and accessible overview books
> about language (specifically in English) that might be useful
> to a student of NLP wishing to broaden their horizons?

That question reminds me of a thread from last week (Feb 5). The subject line was "New techniques in text processing":

Amac Herdagdelen asked:
> Is there anything new/fun that jumps to mind that I should read up on?
> ... What new things do we have/know to offer other fields?

Phil Gooch replied:
> If you're interested in extracting narrative event chains, then this
> might be worth looking at
> http://malt.ml.cmu.edu/mw/index.php/Chambers_and_Jurafsky,_Unsupervised_Learning_of_Narrative_Event_Chains,_ACL_2008
> Also, application of deep learning techniques might be of interest
> http://deeplearning.net/

Adam Kilgarriff replied:
> as well as tools you can trust, you need data you can trust.
> Techniques I describe in Getting to know your corpus
> http://trac.sketchengine.co.uk/attachment/wiki/AK/Papers/Kilgarriff_TSD2012.pdf?format=raw
> are designedto help researchers find the characteristics,
> quirks and biases of their dataset
> (video version http://www.youtube.com/watch?v=0XvWh6YqgkU)

An excerpt from Adam's paper:
> We show, with examples, how keyword lists (of one corpus vs: another)
> are a direct, practical and fascinating way to explore the characteristics
> of corpora, and of text types. Our method is to classify the top one hundred
> keywords of corpus1 vs: corpus2, and corpus2 vs: corpus1. This promptly reveals
> a range of contrasts between all the pairs of corpora we apply it to. We also
> present improved maths for keywords, and briefly discuss quantitative comparisons
> between corpora. All the methods discussed (and almost all of the corpora)
> are available in the Sketch Engine, a leading corpus query tool.

An excerpt from the Chambers & Jurafsky paper:
> Hand-coded scripts were used in the 1970-80s as knowledge backbones
> that enabled inference and other NLP tasks requiring deep semantic
> knowledge. We propose unsupervised induction of similar schemata
> called narrative event chains from raw newswire text.

Adam's paper describes important methods for analyzing corpora. They belong in the toolkit of anyone who processes large volumes of NL texts.

But the paper by Chambers & Jurafsky shows how issues that were popular 30 years ago can be revived as "cutting edge" research today. The important difference is that the old hand-coded scripts can now be derived by new methods of "deep learning".

For an example of a narrative structure by Chambers & Jurafsky, see Figure 6 of http://acl.eldoc.ub.rug.nl/mirror/P/P08/P08-1090.pdf

For structures in kidnap, bombing, attack, and arson, see http://www.stanford.edu/~jurafsky/acl2011-chambers-templates.pdf

The moral of this story is that research directions are heavily influenced by available technology. With new technology, research questions from decades ago (or even millennia ago) can be revived and addressed with new methods.

The implication for education is that research techniques can become obsolete, but fundamental questions never become obsolete. Sometimes the most fruitful research can be inspired by old questions that were abandoned because the available technology was inadequate.

John Sowa

More information about the Corpora mailing list