[Corpora-List] Q: How to identify duplicates in a large document collection

Shlomo Argamon argamon at iit.edu
Wed Dec 22 19:46:00 CET 2004


The people in the IIT IR lab have a recent paper on the topic:
http://ir.iit.edu/publications/downloads/p171-chowdhury.pdf

You might contact the authors directly to see if any software is available.

-Shlomo-

Mike Maxwell wrote:

>> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:

>>

>>> We are facing the task of having to find duplicate and near-duplicate

>>> documents in a collection of about 1 million texts. Can anyone give

>>> us advice on how to approach this challenge?

>

>

> We thought about this awhile back, when it turned out we had paid for

> translation of several pairs of articles where the members of the pair

> each had different filenames. We didn't implement a solution, but here

> are some thoughts:

>

> Do pairs of similar papers contain basically the same number of words? I

> would imagine they do, or you wouldn't be calling them "similar".

>

> I would then use file size as a heuristic, and only compare each article

> with a few of its neighbors in size. That might reduce the complexity

> from N*N to kN, where 'k' is some (hopefully small) constant (and

> assumign that sorting them by size is not time-consuming, which it

> certainly shouldn't be).

>

> If there is variation in the way paragraphs are indicated (e.g. whether

> there is a blank line inserted) and inter-sentential spacing (one space

> character vs. two, maybe), then after converting them to plain text, you

> might find it necessary to go an additional stage and convert them into

> some kind of canonical format, such as tokenized. There are other

> obvious normalizations you might want to apply, too.

>






More information about the Corpora-archive mailing list