[Corpora-List] Q: How to identify duplicates in a large document collection

Shlomo Argamon argamon at iit.edu
Wed Dec 22 19:46:00 CET 2004

The people in the IIT IR lab have a recent paper on the topic:

You might contact the authors directly to see if any software is available.


Mike Maxwell wrote:

>> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:


>>> We are facing the task of having to find duplicate and near-duplicate

>>> documents in a collection of about 1 million texts. Can anyone give

>>> us advice on how to approach this challenge?



> We thought about this awhile back, when it turned out we had paid for

> translation of several pairs of articles where the members of the pair

> each had different filenames. We didn't implement a solution, but here

> are some thoughts:


> Do pairs of similar papers contain basically the same number of words? I

> would imagine they do, or you wouldn't be calling them "similar".


> I would then use file size as a heuristic, and only compare each article

> with a few of its neighbors in size. That might reduce the complexity

> from N*N to kN, where 'k' is some (hopefully small) constant (and

> assumign that sorting them by size is not time-consuming, which it

> certainly shouldn't be).


> If there is variation in the way paragraphs are indicated (e.g. whether

> there is a blank line inserted) and inter-sentential spacing (one space

> character vs. two, maybe), then after converting them to plain text, you

> might find it necessary to go an additional stage and convert them into

> some kind of canonical format, such as tokenized. There are other

> obvious normalizations you might want to apply, too.


More information about the Corpora-archive mailing list