[Corpora-List] Summary - Thanks for the replies! -> Gold standard for document similarity

Ivelina Nikolova iva at lml.bas.bg
Fri Mar 7 14:52:56 CET 2014


Thanks to everyone who replied to my post! I've compiled a summary of the answers which you can see below.

General comment: Comparatively few similarity datasets above the sentence level exist.

Resources:

1. Lee & Pincombe's dataset: Michael D. Lee, Brandon Pincombe, and Matthew Welsh. 2005. An empirical evaluation of models of text document similarity. In Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1254--1259, Mahwah, NJ. Erlbaum.

These are human graded similarities between paragraph sized texts. Need to contact Michael Lee to get access to it. Contact: Michael D. Lee <mdlee at uci.edu>

2. Linda Bawcom's observations: 1) much of the similarity is caused by so many newspapers using the same agency (mostly Reuters and Associated Press -in the United States) to get their news and 2) she used a free online similarity program (really one that is normally used for plagiarism) to find that similarity: http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/. She prepared ? corpus on TSUNAMI-related topics

Contact: Linda Bawcom <linda.bawcom at sbcglobal.net>

3. SemEval Text Similarity task 2013

http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54

- Core task - Given two sentences, s1 and s2, participants will quantifiably inform us on how similar s1 and s2 are, resulting in a similarity score. - Pilot task on typed-similarity between semi-structured records. The types of similarity to be studied include location, author, people involved, time, events or actions, subject, description. Data is available here: http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56

Contact: "Zesch, Torsten, Dr." <torsten.zesch at uni-due.de>

4. 20 newsgroups

http://qwone.com/~jason/20Newsgroups/

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

5. Reuters corpus http://about.reuters.com/researchandstandards/corpus/statistics/index.asp

6. Adam Kilgarriff & Tony Russell-Rose wrote a paper evaluating various metrics for comparing corpora, and as part of that process created a set of 'known similarity corpora' which included various newspaper sources. It's documented here: Measures for corpus similarity and homogeneity http://aclweb.org/anthology//W/W98/W98-1506.pdf The documents are here: ftp://ftp.itri.brighton.ac.uk/KSC The METER Corpus is here: http://nlp.shef.ac.uk/meter/

Contacts: Tony Russell-Rose <tgr at russellrose.com>, Paul D Clough <p.d.clough at sheffield.ac.uk>

7. JRC resources - JEX corpus, which accompanies the JEC software (http://ipsc.jrc.ec.europa.eu/index.php?id=60) - The news clusters downloaded and annotated for multi-document summarisation (see at the bottom of the page http://ipsc.jrc.ec.europa.eu/?id=61). - NewsExplorer news clusters (e.g. http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html).

Contacts: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>

8. Recent publications on the topic Daniel Baer's PhD Thesis: http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf

--Ivelina

-- Ivelina Nikolova PhD student in Computer Science Linguistic Modelling Department Institute of Information and Communication Technologies Bulgarian Academy of Sciences

On 03/05/2014 04:23 PM, Paul D Clough wrote:
> Hi, for research purposes there is the METER Corpus:
> http://nlp.shef.ac.uk/meter/. Let me know if you want a copy. I helped
> create the corpus to assess methods for detecting text reuse.
>
> Paul.
>
>
>
> On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com
> <mailto:tgr at russellrose.com>> wrote:
>
> A few years ago Adam Kilgarriff & I wrote a paper evaluating
> various metrics for comparing corpora, and as part of that process
> created a set of 'known similarity corpora' which included various
> newspaper sources. It's documented here:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
> Not sure we still have the data but it shouldn't be too difficult
> to recreate (feel free to contact me offline)
>
> HTH,
> Tony
> --
> -------------------------------
> Tony Russell-Rose PhD FBCS CITP
> Vice-chair, BCS IRSG
> Chair, IEHF HCI Group
> http://uxlabs.co.uk
> http://isquared.wordpress.com
>
> On 04/03/2014 15:48, Ivelina Nikolova wrote:
>> Dear corpora members,
>>
>> I am looking for a gold standard to train/evaluate document
>> similarity metrics.
>> Can anyone suggest a suitable corpus for such purposes. I'm
>> especially interested in similarity between newspaper articles.
>>
>> Thanks in advance,
>> Ivelina
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>
>
>
> --
> -------------------------------------------------------------------------
> Dr. Paul Clough
> Reader in Information Retrieval
>
> Information School
> University of Sheffield
> Regent Court
> Sheffield S1 4DP
> Tel: +44 (0)114 2222664
> Fax: +44 (0)114 2780300
> Email: p.d.clough at sheffield.ac.uk <mailto:p.d.clough at sheffield.ac.uk>
> Web: http://ir.shef.ac.uk/cloughie/
> -------------------------------------------------------------------------
>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11974 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140307/2fb241b7/attachment.txt>



More information about the Corpora mailing list