[Corpora-List] Q: How to identify duplicates in a large document collection

Adam Kilgarriff adam at lexmasterclass.com
Thu Dec 23 07:50:00 CET 2004


We recently encountered the problem with the LDC’s English Gigaword corpus:
many of the stories in this newswire corpus occur repeatedly, with changing
datelines, often in updated and revised forms. We have also hit the
question when producing corpora for dictionary-making from the web.



A crucial question in these situations is: what are the objects which might
be considered duplicates? If two stories share two paragraphs, but each
have two further paragraphs that are not shared, it is not obvious what
should be done. Our solution (working with Infogistics Ltd, from Edinburgh)
heuristically identified ‘paragraphs’ and treated them as the objects which
might be duplicates. It also looked at successions of paragraphs because,
firstly, identical short paragraphs may have been produced independently on
two or more occasions, and secondly, stripping out paragraphs destroys the
integrity of the text, so we did not want to do it lightly.



I think one set of papers mentioned in earlier responses to the query, which
used document similarity, won’t help in our scenario but another, which
looks for longest common substrings (see Alexander Clark’s mail) will.



The interesting theoretical question lurking around here is: when does a
common expression (essential subject matter for corpus linguistics) turn
into duplication (which is not wanted). Duplication of the former kind is
the fabric of language. If I speak in formulae and clichés, as so many of
us do so much of the time, it is likely that my speaker turns will exactly
match others’. Quotations are another intermediate case – if someone quotes
half a sentence from a text that is also in the corpus, you want to leave it
in. If it is a couple of sentences – maybe. If it is a couple of
paragraphs or more you may well want to throw it out as duplication. My
suspicion is, it will always depend on what you want to do with the corpus.



Adam Kilgarriff

Lexical Computing Ltd





-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Ralf Steinberger
Sent: 22 December 2004 16:46
To: List Corpora (Corpora list)
Subject: [Corpora-List] Q: How to identify duplicates in a large document
collection



We are facing the task of having to find duplicate and near-duplicate
documents in a collection of about 1 million texts. Can anyone give us
advice on how to approach this challenge?



The documents are in various formats (html, PDF, MS-Word, plain text, ...)
so that we intend to first convert them to plain text. It is possible that
the same text is present in the document collection in different formats.



For smaller collections, we identify (near)-duplicates by applying
hierarchical clustering techniques, but with this approach, we are limited
to a few thousand documents.



Any pointers are welcome. Thank you.



Ralf Steinberger

European Commission - Joint Research Centre

http://www.jrc.it/langtech



-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20041223/d39b1fa5/attachment.html


More information about the Corpora-archive mailing list