Based on statistics like frequency, mutual information, or distribution of the n-grams, you could discard the smaller-n n-gram if: * its frequency is equal to that of the larger-n n-gram (i.e., all occurrences of the smaller n-gram are actually part of the larger n-gram in the corpus) * its frequency is greater that (some value)*the frequency of the larger n-gram (e.g., at least 80% of the smaller n-gram occurrences are part of the larger n-gram) * if the mutual information for the larger n-gram is greater than for the smaller n-gram plus some adjustment I think it really makes sense to (a) go the whole way and approximate what you want as well as you reasonably can and (b) explicitly reason about what you are approximating with it, since data-driven approaches like this can easily lead onto the slippery slope to cargo-cult science where people blindly use nontrivial tool X to achieve a simple problem Y that actually has good solutions somewhere else (e.g, X=compression programs, Y=language modeling, where the speech community has been working for decades on n-gram- and syntax-based language models which also do a much better job at it).
You might want to look at the research of Douglas Biber, who uses n-grams with some additional information and calls them "lexical bundles". e.g.: http://applij.oxfordjournals.org/cgi/content/abstract/25/3/371 Biber/Conrad/Cortes "If you look at ...: Lexical Bundles in University Teaching and Textbooks"
Best wishes, -- Yannick Versley Seminar für Sprachwissenschaft, Abt. Computerlinguistik Wilhelmstr. 19, 72074 Tübingen Tel.: (07071) 29 77352