[Corpora-List] ACL Anthology Reference Corpus, Version 2, released

Min-Yen Kan knmnyn at gmail.com
Tue Mar 1 17:54:29 CET 2016


Dear Corpora List members:

(Apologies for the cross-posting)

The Association for Computational Linguistics (ACL) has had a longstanding history of publishing its scholarly works under a permissive license that allows for open source sharing for most purposes. The archives of these works have been available in the ACL Anthology (http://www.aclweb.org/anthology) for any to read and re-use, for a number of years.

We have now released a new version of the ACL Anthology Reference Corpus (v2), updated to include all ACL venues up to December 2015. It is made available alongside the original v1 corpus released in 2009.

http://acl-arc.comp.nus.edu.sg/

The new version includes all ACL Anthology files whose copyright belongs to the ACL (i.e., excluding COLING, LREC, and other third-party, sister CL/NLP associations' publications), totalling 22,878 articles. We hope this frozen corpus will be used for benchmarking applications for scholarly and bibliometric data processing.

After collating the shared task ideas that we solicited from this community in August 2015, we are organizing a shared task on scientific (computational linguistics) document summarization that is open for registration, with support from Microsoft Research Asia. We will be announcing this CL-SciSumm 2016 task on the list shortly. The shared task results will be part of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016), co-located with the Joint Conference on Digital Libraries (JCDL '16), Newark, New Jersey, USA, on 23 June 2016. http://wing.comp.nus.edu.sg/cl-scisumm2016/

We hope this frozen corpus will be used for benchmarking applications for scholarly and bibliometric data processing, as the first ACL ARC release was used by the community since its release in 2009. We look forward to the community using its own scholarly publications to advance the speed of discovery and dissemination in computational linguistics.

Due the size of the corpus (we will be pursuing alternative methods for disseminating the corpus), please download the component files one at a time, and don't attempt to do parallel downloads as it eats up all of our webserver's bandwidth. If you're interested in obtaining this corpus on a USB drive, please feel free to send an email to me <kanmy at comp.nus.edu.sg>; we may prepare and send this corpus to faculty or institutional representatives to ease download frustration.

As with the ACL Anthology itself, the corpus is Creative Commons licensed and royalty-free, and can be freely shared and copied, subject to attribution of the original ACL Anthology source.

Cheers,

- Min-Yen Kan ACL Anthology Editor

On Wed, Aug 5, 2015 at 3:17 PM, Min-Yen Kan <knmnyn at gmail.com> wrote:
> Dear Corpora List members:
>
> (Apologies for the cross-posting)
>
> The Association for Computational Linguistics (ACL) has had a
> longstanding history of publishing its scholarly works under a
> permissive license that allows for open source sharing for most
> purposes. The archives of these works have been available in the ACL
> Anthology (http://www.aclweb.org/anthology) for any to read and
> re-use, for a number of years.
>
> To better serve our own community in corpus linguistics, we plan to
> release a machine readable version with the text and logical document
> formatting of the articles, for all of the scholarly publications in
> the ACL Anthology. This should be forthcoming within the next few
> months, and shall be announced here as well.
>
> At this stage, we would like to solicit ideas for shared tasks or
> workshop themes that would involve the scholarly materials in the ACL
> Anthology. Some suggestions have been to hold a task for document
> retrieval, document summarization, keyphrase extraction or sentiment
> analysis task.
>
> A significant difficulty is in annotation of ground truth for any of
> these tasks. Without a funding source, we are planning to ask
> participants to do pooled annotation of system results, in the style
> of TREC.
>
> We hope with this post to be able to seed the discussion about such a
> dataset and task, with the objective of building up a community
> initiated workshop in the 2016/2017 timeframe.
>
> Thank you for your attention!
>
> - Min-Yen Kan
> ACL Anthology Editor



More information about the Corpora mailing list