[Corpora-List] Reducing n-gram output

Damir C'avar dcavar at indiana.edu
Tue Oct 28 23:56:53 CET 2008

Dahlmann Irina wrote:
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
The c-gram approach just gives you some view at bigger n-grams that contain some smaller n-gram of your choice, etc. The question is always, what is your application or general idea.

IMHO, in general, such a view (plus statistical information and maybe some symbol introduction system) is exploited in many grammar induction systems that use e.g. alignment (different sequences of strings occurring in the same context, or some context with different sequences occurring in it). This reminds me of the notion of substitutability in the structuralist tradition (e.g. Zelig Harris), or Alignment-based learning (e.g. van Zaanen), and in some way also in the mentioned work on e.g. morphology induction (e.g. Goldsmith).

For pure corpus analysis and visualization of n-gram relations this might be the only relevant reference, i.e. Paul Rayson's c-grams. Multigrams (used in some CL tasks, e.g. LID) might be related to this, at least from the applied perspective.


