> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-
> gram 'at the end of the' as part of the 6-gram 'at the end of the
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
Suffix trees, suffix arrays, and their relatives are compact data structures following exactly the intuition that smaller strings are substrings of larger ones. They represent all possible n-grams (without limit on n) of a text in space proportional to the length of the text and support efficient retrieval, counting, and other queries on substrings of the text; there is a vast literature on their various applications (and theory linking them to compressibility, etc.).
Whether they are useful for you depends on: 1) your application, and 2) whether your corpus fits in memory (the development of efficient external data structures with these properties is an active research area).