[Corpora-List] Q: How to identify duplicates in a large document collection

Marian Olteanu mou_softwin at yahoo.com
Thu Dec 23 07:16:01 CET 2004


Sorry I don't have time to read the papers recomended, but if I would be in your shoes and I would
look for perfect match (detect not similar documents, but identical documents), I would compute
MD5 for each document and then I will look for duplicates. If I would encounter duplicates, I
would do a comparison between the two documents. This algorithm is practically O(n) + O(m*m)
(m=number of duplicate documents in the collection of n documents), because the probability to
encounter the same MD5 value for two different documents is very-very low (with a extremely high
probability, you will encounter no more than one false positive in MD5 comparison).
Because you have different document types, I would convert them all to a common format before
extracting MD5 value (i.e: extract text, keep only letters and digits (ignore punctuation and
spaces), uppercase everything)

--- Ralf Steinberger <ralf.steinberger at jrc.it> wrote:


> We are facing the task of having to find duplicate and near-duplicate

> documents in a collection of about 1 million texts. Can anyone give us

> advice on how to approach this challenge?

>

> The documents are in various formats (html, PDF, MS-Word, plain text, ...)

> so that we intend to first convert them to plain text. It is possible that

> the same text is present in the document collection in different formats.

>

> For smaller collections, we identify (near)-duplicates by applying

> hierarchical clustering techniques, but with this approach, we are limited

> to a few thousand documents.

>

> Any pointers are welcome. Thank you.

>

> Ralf Steinberger

> European Commission - Joint Research Centre

> http://www.jrc.it/langtech

>

>



=====
Marian
http://www.utdallas.edu/~mgo031000/



__________________________________
Do you Yahoo!?
Yahoo! Mail - 250MB free storage. Do more. Manage less.
http://info.mail.yahoo.com/mail_250





More information about the Corpora-archive mailing list