[Corpora-List] Q: How to identify duplicates in a large document collection

Gregor Erbach gor at acm.org
Wed Dec 22 23:39:01 CET 2004


I know of two publications on the efficient detection of duplicates
and near-duplicates in large document collections:

Andrei Z. Broder et al.
Syntactic Clustering of the Web
http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/

US Patent 6658423
PUGH WILLIAM and HENZINGER MONIKA H
Google Inc.
Detecting duplicate and near-duplicate files
http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US6658423&F=0

regards,

Gregor

Ralf Steinberger wrote:


> We are facing the task of having to find duplicate and near-duplicate

> documents in a collection of about 1 million texts. Can anyone give us

> advice on how to approach this challenge?

>

> The documents are in various formats (html, PDF, MS-Word, plain text,

> ...) so that we intend to first convert them to plain text. It is

> possible that the same text is present in the document collection in

> different formats.

>

> For smaller collections, we identify (near)-duplicates by applying

> hierarchical clustering techniques, but with this approach, we are

> limited to a few thousand documents.

>

> Any pointers are welcome. Thank you.

>

> Ralf Steinberger

> European Commission - Joint Research Centre

> http://www.jrc.it/langtech

>


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Gregor Erbach http://purl.org/net/gregor/
DFKI GmbH, Language Technology Lab http://www.dfki.de/
Tel. +49 (681) 302-5354 mailto:erbach at dfki.de








More information about the Corpora-archive mailing list