[Corpora-List] Q: How to identify duplicates in a large document collection

Tom Emerson tree at basistech.com
Wed Dec 22 19:34:02 CET 2004


Rolf,

The work of Broder et al. published at WWW6 a common root for many
duplicate document detection algorithms,

Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey
Zweig. 1997. "Syntactic Clustering of the Web". In Proceedings of the
6th World Wide Web Conference (WWW6).
http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html

There has been quite a bit of work following on from the shingle
fingerprinting proposed in that original paper: there are 113
citations listed in CiteSeer.

We have been experimenting with various techniques for identifying
similar content on large, multilingual document collections harvested
from the Web, but are not ready to present any results.

-tree

--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"





More information about the Corpora-archive mailing list