[Corpora-List] english summarization dataset request

Min-Yen Kan knmnyn at gmail.com
Tue May 31 12:28:48 CEST 2016


Hi all:

You can also try the CL SciSumm dataset https://github.com/WING-NUS/scisumm-corpus , which is still being developed for use in the shared task at BIRNDL 2016 this year.

--- February 29, 2016

This package contains a release of training topics to aid in the development of computational linguistics summarization systems.

Please read further for details on the Computational Linguistics Shared Task run as part of BIRNDL 2016 workshop collocated with JCDL 2016 - official website hosted at: http://wing.comp.nus.edu.sg/cl-scisumm2016/

To participate in the 2016 shared task, please register your team details at: https://easychair.org/conferences/?conf=birndl2016

To know how this corpus was constructed, please see ./docs/corpusconstruction.txt

Overview

You are invited to participate in the CL-SciSumm Shared Task at BIRNDL 2016. The shared task will be on automatic paper summarization in the Computational Linguistics (CL) domain. The output summaries will be of two types: faceted summaries of the traditional self-summary (the abstract) and the community summary (the collection of citation sentences ‘citances’). We also propose to group the citances by the facets of the text that they refer to.

This task follows up on the successful CL Pilot Task conducted as a part of the BiomedSumm Track at the Text Analysis Conference 2014 (TAC 2014). It follows the basic structure and guidelines of the Biomedical Summarization Track and adapts them for annotating and creating a corpus of training topics from computational linguistics research papers. The task is defined as follows:

Given: A topic consisting of a Reference Paper (RP) and upto 10 Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP.

Task 1a: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5). Task 1b: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets. Task 2 (optional bonus task): Finally, generate a structured summary of the RP from the cited text spans of the RP. The length of the summary should not exceed 250 words.

Evaluation: Task 1 will be scored by overlap of text spans measured by number of sentences in the system output vs gold standard. Task 2 will be scored using the ROUGE family of metrics between i) the system output and the gold standard summary fromt the reference spans ii) the system output and the asbtract of the reference paper. Again, Task 2 is optional. Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy at comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Mon, May 30, 2016 at 5:55 PM, Ricardo Daniel Santos Faro Marques Ribeiro <ricardo.ribeiro at inesc.pt> wrote:
> Dear Shadi,
>
> For other datasets, see, for example,
>
> - https://ec.europa.eu/jrc/en/language-technologies [Turchi Marco, Josef
> Steinberger, Mijail Kabadjov & Ralf Steinberger (2010). Using Parallel
> Corpora for Multilingual (Multi-Document) Summarisation Evaluation.
> Multilingual and Multimodal Information Access Evaluation. Springer Lecture
> Notes for Computer Science, LNCS 6360/2010, pp. 52-63]
>
> - http://www.taln.upf.edu/pages/concisus/index.html
>
> - http://multiling.iit.demokritos.gr/
>
> Best regards,
>
> —Ricardo Ribeiro.
>
> On 29 May 2016, at 08:03, Shadi Hossein Nejad <shadi.hn at gmail.com> wrote:
>
> hi everybody
> I'm a student in NLP field and for evaluation of my summarization system, I
> need English summarization dataset. Actually I could'nt get DUC dataset from
> NIST website because I'm kind of independent researcher and the only version
> I could download on web did not include fulltext files and just had
> summaries. I was wondering if any of you could please help me and send me a
> dataset to test my system? or a link that I can download the full version of
> DUC or TAC dataset?
>
> I really dont know what to do and I appreciate your help so much.
> thanks a lot in advanced for your attention and help
> shadi
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



More information about the Corpora mailing list