[Corpora-List] Reducing n-gram output

Alexandre Rafalovitch arafalov at gmail.com
Sat Nov 1 02:00:44 CET 2008


One of the interesting papers around suffix arrays is by Chunyu KIT: "The Virtual Corpus approach to deriving n-gram statistics from large scale corpora" http://personal.cityu.edu.hk/~ctckit/papers/vc.pdf

I have some (scary) java code based around those concepts that can do statistical analysis on n-grams with n above 140 and does look at (n-1)-grams and (n+1)-grams with the same length.

If that's of interest, I would be happy to discuss this further in a direct email (to not bother the list).

Regards,

Alex.

Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/

On Tue, Oct 28, 2008 at 11:21 AM, Adam Lopez <alopez at inf.ed.ac.uk> wrote:
>
>> I was wondering whether anybody is aware of ideas and/or automated
>> processes to reduce n-gram output by solving the common problem that
>> shorter n-grams can be fragments of larger structures (e.g. the 5-
>> gram 'at the end of the' as part of the 6-gram 'at the end of the
>> day')
>>
>> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>
> Suffix trees, suffix arrays, and their relatives are compact data
> structures following exactly the intuition that smaller strings are
> substrings of larger ones. They represent all possible n-grams
> (without limit on n) of a text in space proportional to the length of
> the text and support efficient retrieval, counting, and other queries
> on substrings of the text; there is a vast literature on their various
> applications (and theory linking them to compressibility, etc.).



More information about the Corpora mailing list