[Corpora-List] Gold standard for document similarity

Paul D Clough p.d.clough at sheffield.ac.uk
Wed Mar 5 15:23:52 CET 2014


Hi, for research purposes there is the METER Corpus: http://nlp.shef.ac.uk/meter/. Let me know if you want a copy. I helped create the corpus to assess methods for detecting text reuse.

Paul.

On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com> wrote:


> A few years ago Adam Kilgarriff & I wrote a paper evaluating various
> metrics for comparing corpora, and as part of that process created a set of
> 'known similarity corpora' which included various newspaper sources. It's
> documented here:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
> Not sure we still have the data but it shouldn't be too difficult to
> recreate (feel free to contact me offline)
>
> HTH,
> Tony
> --
> -------------------------------
> Tony Russell-Rose PhD FBCS CITP
> Vice-chair, BCS IRSG
> Chair, IEHF HCI Group
> http://uxlabs.co.uk
> http://isquared.wordpress.com
>
> On 04/03/2014 15:48, Ivelina Nikolova wrote:
>
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- ------------------------------------------------------------------------- Dr. Paul Clough Reader in Information Retrieval

Information School University of Sheffield Regent Court Sheffield S1 4DP Tel: +44 (0)114 2222664 Fax: +44 (0)114 2780300 Email: p.d.clough at sheffield.ac.uk Web: http://ir.shef.ac.uk/cloughie/ ------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3367 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140305/7b69b9db/attachment.txt>



More information about the Corpora mailing list