[Corpora-List] datasets for automatic keyword/keyphrase extraction task

Alexander Schutz goalscoringsuperstarhero at gmail.com
Wed May 5 10:37:27 CEST 2010


a dataset resulting from my master's thesis, 'Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods' [1] is available at [2].

It was based on the PubMed dataset available for download [3], which already contains keyphrases for documents. My dataset basically contains a reference back to the original PubMed article via pmcid, the originally assigned keyphrases (gold standard), the keyphrases assigned by my approach including confidence, some indications as to which sort of match between gold standard and approach has occurred, and some document statistics. This is all on a per-document basis, covering 1323 documents from the original PubMed dataset (80k or so docs).

For those who do not have time to read the full thesis, the procedure is summarised in [4] and subsequent pages. To gain a proper understanding of how this dataset was yielded, it is at least necessary to read and understand [5], or the evaluation chapter of the thesis.

Happy extracting. Alex

P.S. There is also a dataset for qualitative evaluation results, however as this comprised keyphrases from user-specified content, I suspect this is not useful for anyone else.

P.P.S. If you have questions don't hesitate go gimme a shout

[1] http://smile.deri.ie/sites/default/files/schutz-mappsc-2008-keyphrase-extraction_revised.pdf [2] http://smile.deri.ie/sites/default/files/quantitative-evaluation-dataset.zip [3] ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.tar.gz [4] http://smile.deri.ie/projects/keyphrase-extraction [5] http://smile.deri.ie/node/204

On Wed, May 5, 2010 at 2:46 AM, Su Nam Kim <sunamkim at gmail.com> wrote:
> Hello, all
> 4 datasets for automatic keyphrase extraction task are available at
> http://github.com/skrathnam/AutomaticKeyphraseExtraction
> If you have questions about datasets, please contact the data
> developers directly.
> Also, if you have a dataset to share, please, contact me to post.
> Thank you.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list