[Corpora-List] Reducing n-gram output

Detmar Meurers dm at ling.ohio-state.edu
Tue Oct 28 21:17:33 CET 2008


Dear Irina,

I was wondering whether anybody is aware of ideas and/or automated

processes to reduce n-gram output by solving the common problem that

shorter n-grams can be fragments of larger structures (e.g. the 5-gram

'at the end of the' as part of the 6-gram 'at the end of the day')

on http://decca.osu.edu you can find the Python code Markus Dickinson, Adriane Boyd and I used for detecting errors in corpus annotation, which implements a version of the a priori algorithm to efficiently compute the longest recurring n-grams in a corpus. There also are some papers there discussing the algorithm (the EACL'03 paper is probably best since the annotations are irrelevant for your purposes).

Best, Detmar

-- Prof. Dr. Detmar Meurers, Universität Tübingen http://purl.org/dm Seminar für Sprachwissenschaft, Wilhelmstr. 19, 72074 Tübingen, Germany



More information about the Corpora mailing list