[Corpora-List] Reducing n-gram output

Yannick Versley versley at sfs.uni-tuebingen.de
Tue Oct 28 15:19:20 CET 2008



> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
The technical problem in this (looking if there are n-grams with larger n that contain this substring) is not really complicated - so the essential question is what you want to achieve with it, and this would give you an idea about criteria you use to discard smaller-n n-grams.

Based on statistics like frequency, mutual information, or distribution of the n-grams, you could discard the smaller-n n-gram if: * its frequency is equal to that of the larger-n n-gram (i.e., all occurrences of the smaller n-gram are actually part of the larger n-gram in the corpus) * its frequency is greater that (some value)*the frequency of the larger n-gram (e.g., at least 80% of the smaller n-gram occurrences are part of the larger n-gram) * if the mutual information for the larger n-gram is greater than for the smaller n-gram plus some adjustment I think it really makes sense to (a) go the whole way and approximate what you want as well as you reasonably can and (b) explicitly reason about what you are approximating with it, since data-driven approaches like this can easily lead onto the slippery slope to cargo-cult science where people blindly use nontrivial tool X to achieve a simple problem Y that actually has good solutions somewhere else (e.g, X=compression programs, Y=language modeling, where the speech community has been working for decades on n-gram- and syntax-based language models which also do a much better job at it).

You might want to look at the research of Douglas Biber, who uses n-grams with some additional information and calls them "lexical bundles". e.g.: http://applij.oxfordjournals.org/cgi/content/abstract/25/3/371 Biber/Conrad/Cortes "If you look at ...: Lexical Bundles in University Teaching and Textbooks"

Best wishes, -- Yannick Versley Seminar für Sprachwissenschaft, Abt. Computerlinguistik Wilhelmstr. 19, 72074 Tübingen Tel.: (07071) 29 77352



More information about the Corpora mailing list