[Corpora-List] Reducing n-gram output

Adam Lopez alopez at inf.ed.ac.uk
Tue Oct 28 16:21:52 CET 2008



> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-
> gram 'at the end of the' as part of the 6-gram 'at the end of the
> day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).

Suffix trees, suffix arrays, and their relatives are compact data structures following exactly the intuition that smaller strings are substrings of larger ones. They represent all possible n-grams (without limit on n) of a text in space proportional to the length of the text and support efficient retrieval, counting, and other queries on substrings of the text; there is a vast literature on their various applications (and theory linking them to compressibility, etc.).

Whether they are useful for you depends on: 1) your application, and 2) whether your corpus fits in memory (the development of efficient external data structures with these properties is an active research area).

Adam



More information about the Corpora mailing list