[Corpora-List] Reducing n-gram output

J Washtell lec3jrw at leeds.ac.uk
Tue Oct 28 14:20:36 CET 2008


There are compression algorithms, such as LZW, that work on this principle. In theory you can "optimally" compress a text by recursively storing all strings as combinations of the most common substrings and so on, and using a minimal number of bits for the most common substrings (huffman encoding). In practice its a bit problematic because you don't know what the most common substrings are until you've examined the entire text, so efficient compression algorithms based on this principle tend to choose the substrings by which to represent a string based only on what they've seen so far, and settle for being sub-optimal in exchange for working in linear time and with limited working memory requirements.

A benefit of this general approach is that the method's predictive power, and hence the compression ratio, increases with the more text you store, as language tends to be re-used. In the extreme case, for example, you can encode multiple revisions of the same document incredibly efficiently, as an edited document is made up mostly of large chunks of the previous version which you have already stored.

There is code out there for LZW compression and its relatives, and you could quite easily implement your own, adapted towards the language-oriented tasks I imagine that you have in mind.

From a language modelling persepective, it would be preferable to go the whole-hog and ascertain frequencies of every n-gram from n=1 upwards, prior to compression. If you start at the character level, rather than the word level, then you get morphological analysis for free! You would end up with a nifty language model from which you could estimate the probability of a given (unseen) string, or identify morphologically remarkable words or unusual idiomatic phrases, by comparing their frequency of occurrence with the combined probabilities of their most probable substrings. It might be even more pertinent to sum the probabilities of *all* possible substring interpretations of a string, but to do that you are probably looking at a huge computational task.


Justin Washtell University of Leeds

Quoting Dahlmann Irina <aexid at nottingham.ac.uk>:

> Dear all,
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
> Many thanks,
> Irina Dahlmann
> PhD student
> School of English Studies
> University of Nottingham
> aexid at nottingham.ac.uk
> This message has been checked for viruses but the contents of an attachment
> may still contain software viruses, which could damage your computer system:
> you are advised to perform your own checks. Email communications with the
> University of Nottingham may be monitored as permitted by UK legislation.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list