[Corpora-List] Gold standard for document similarity

Tony Russell-Rose tgr at russellrose.com
Wed Mar 5 11:13:55 CET 2014


A few years ago Adam Kilgarriff & I wrote a paper evaluating various metrics for comparing corpora, and as part of that process created a set of 'known similarity corpora' which included various newspaper sources. It's documented here:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716

Not sure we still have the data but it shouldn't be too difficult to recreate (feel free to contact me offline)

HTH, Tony -- ------------------------------- Tony Russell-Rose PhD FBCS CITP Vice-chair, BCS IRSG Chair, IEHF HCI Group http://uxlabs.co.uk http://isquared.wordpress.com

On 04/03/2014 15:48, Ivelina Nikolova wrote:
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1811 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140305/55670532/attachment.txt>



More information about the Corpora mailing list