[Corpora-List] Q: How to identify duplicates in a large document collection

Alexander Clark asc at aclark.demon.co.uk
Wed Dec 22 19:09:00 CET 2004


One way that I have used on a collection of about 50K documents is to
construct a suffix array + longest common prefix table. This will allow
you to identify every "exactly" repeated string in the corpus.
This can be done in (where n is total length of all documents) O(n log
n) easily or, or O(n) with a bit of work.
This assumes that the duplicated portions of the documents are exact.
If all you are doing is looking for exact duplicates then just compute a
good hash function for each file and sort them.

This is related to the problem of finding all maximal repeats in a
string of length n which can be done in linear time.

"Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology" by
Gusfield (CUP) 1997 is the standard textbook for this stuff.





On Wed, 2004-12-22 at 16:45, Ralf Steinberger wrote:

> We are facing the task of having to find duplicate and near-duplicate

> documents in a collection of about 1 million texts. Can anyone give us

> advice on how to approach this challenge?

>

> The documents are in various formats (html, PDF, MS-Word, plain text,

> ...) so that we intend to first convert them to plain text. It is

> possible that the same text is present in the document collection in

> different formats.

>

> For smaller collections, we identify (near)-duplicates by applying

> hierarchical clustering techniques, but with this approach, we are

> limited to a few thousand documents.

>

> Any pointers are welcome. Thank you.

>

> Ralf Steinberger

> European Commission - Joint Research Centre

> http://www.jrc.it/langtech

>

--
Alexander Clark asc at aclark.demon.co.uk alexc at cs.rhul.ac.uk
http://www.cs.rhul.ac.uk/home/alexc/
Lecturer, Department of Computer Science,
Royal Holloway, University of London, Egham, Surrey TW20 0EX
Direct 01784 443430 Department 01784 434455 Fax 01784 439786







More information about the Corpora-archive mailing list