Here's one example using longest common prefixes:

Yamamoto and Church, Using Suffix Arrays to Compute Term Frequency

and Document Frequency for All Substrings in a Corpus,

Comp. Linguistics, 2000.


Some years back I used this technique to help identify bilingual phrasal equivalents

McNamee and Mayfield, Translation of Multiword Expressions Using Parallel

Suffix Arrays, AMTA 2006.


An actual use of LC substring is found in proper name variant matching (i.e., is "Mikhail Sergeyevich Gorbachev" coreferent with "Michail Gorbatchev")



LCS is also widely used as a means to identify spans of text that are duplicates or near duplicates; similar methods can also be applied to the problems of plagarism detection and authorship attribution.

> LCS algorithms are heavily used in bioinformatics to analyze DNA sequences
> How are they used in corpora research?
