We are happy to introduce a new corpus of scientific writing for use by the NLP/ML community: the Elsevier OA CC-By corpus. This corpus consists of a large (40k documents) collection of Scientific Research Papers, providing a representative sample from across scientific disciplines. The collection not only includes the full text of the articles (in machine readable form), but also the metadata of the documents and the bibliographic information for each reference.
The data set is published on Mendeley Data and can be found here: Kershaw, Daniel; Koeling, Rob (2020), “Elsevier OA CC-BY Corpus”, Mendeley Data, v1http://dx.doi.org/10.17632/zm33cdndxs.1
A short paper describing the dataset can be found here: https://arxiv.org/abs/2008.00774
Please get in touch with either one of the authors with any questions regarding this dataset.
Rob Koeling Principal Data Scientist Research Products, Elsevier -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3054 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200807/5163fa6a/attachment.txt>