[Corpora-List] longest common subsequenceS algorithms in corpora research ...

Paul McNamee paul.mcnamee at jhuapl.edu
Tue Feb 19 17:38:42 CET 2013

Here's one example using longest common prefixes:

Yamamoto and Church, Using Suffix Arrays to Compute Term Frequency

and Document Frequency for All Substrings in a Corpus,

Comp. Linguistics, 2000.


Some years back I used this technique to help identify bilingual phrasal equivalents

McNamee and Mayfield, Translation of Multiword Expressions Using Parallel

Suffix Arrays, AMTA 2006.


An actual use of LC substring is found in proper name variant matching (i.e., is "Mikhail Sergeyevich Gorbachev" coreferent with "Michail Gorbatchev")



LCS is also widely used as a means to identify spans of text that are duplicates or near duplicates; similar methods can also be applied to the problems of plagarism detection and authorship attribution.

- Paul

On Tue, 19 Feb 2013, Albretch Mueller wrote:

> LCS algorithms are heavily used in bioinformatics to analyze DNA sequences
> How are they used in corpora research?
> thanks,
> lbrtchx
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list