[Corpora-List] Release of a new web-scale dependency-parsed corpus based on CommonCrawl

Alexander Panchenko panchenkoalexander at gmail.com
Tue Nov 7 17:07:22 CET 2017


Hello,

A new web-scale dependency-parsed corpus based on CommonCrawl is now available: https://commoncrawl.s3.amazonaws.com/contrib/depcc/CC-MAIN-2016-07/index.html . DepCC is a large linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl. You can access/download the corpus from Amazon S3. If you use the corpus on Amazon you do not need to download it: you can directly use it from the S3 file system for free (in the us-east-1 zone) from an Amazon EC2 instance. More details are available on the official page of the corpus: https://arxiv.org/abs/1710.01779

— Best regards, Dr. Alexander Panchenko

University of Hamburg Faculty of Mathematics, Informatics and Natural Sciences Department of Informatics, Language Technology Group Vogt-Kölln-Str. 30, F-416, 22527 Hamburg Tel: +49 40 428 832 368 https://www.inf.uni-hamburg.de/en/inst/ab/lt/people/alexander-panchenko.html [image: Inline image 1] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1537 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20171107/8bd42281/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: image.gif Type: image/gif Size: 3241 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20171107/8bd42281/attachment.gif>



More information about the Corpora mailing list