[Corpora-List] Q: How to identify duplicates in a large document collection
argamon at iit.edu
Wed Dec 22 19:46:00 CET 2004
The people in the IIT IR lab have a recent paper on the topic:
You might contact the authors directly to see if any software is available.
Mike Maxwell wrote:
>> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>>> We are facing the task of having to find duplicate and near-duplicate
>>> documents in a collection of about 1 million texts. Can anyone give
>>> us advice on how to approach this challenge?
> We thought about this awhile back, when it turned out we had paid for
> translation of several pairs of articles where the members of the pair
> each had different filenames. We didn't implement a solution, but here
> are some thoughts:
> Do pairs of similar papers contain basically the same number of words? I
> would imagine they do, or you wouldn't be calling them "similar".
> I would then use file size as a heuristic, and only compare each article
> with a few of its neighbors in size. That might reduce the complexity
> from N*N to kN, where 'k' is some (hopefully small) constant (and
> assumign that sorting them by size is not time-consuming, which it
> certainly shouldn't be).
> If there is variation in the way paragraphs are indicated (e.g. whether
> there is a blank line inserted) and inter-sentential spacing (one space
> character vs. two, maybe), then after converting them to plain text, you
> might find it necessary to go an additional stage and convert them into
> some kind of canonical format, such as tokenized. There are other
> obvious normalizations you might want to apply, too.
More information about the Corpora-archive