[Corpora-List] Westbury Lab English Wikipedia corpus now available. (April 2010 version)

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu May 20 23:57:49 CEST 2010

Dear Fellow Corpora List members:

In a similar style to our USENET corpus, we have just released the first version of a corpus extracted from the English Wikipedia. This was created from a snapshot taken in April, 2010. It is freely available immediately at the following URL:


There is a complete description of the corpus on the above web page, but here are a few quick points:

1) Size: over 900 million words in over 2.8 million documents. 2) Clean text, unprocessed and untagged. 3) Distributed as a single file (1.8Gb, compressed) with document delimiters. 4) CC license. Please read the licensing for this corpus and for Wikipedia carefully.

As always, it is available as a direct download to those on the Internet2. For normal Internet connections, we offer a BitTorrent download. If you use the BitTorrent download, please help us synchronize the swarm by commencing your download today, and leave your BitTorrent program running for a few days after you complete downloading the file. This will help others download the file, and help you create some good karma for yourself.

Your feedback is welcome and appreciated,


-- =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=} Cyrus Shaoul http://www.psych.ualberta.ca/~westburylab/ University of Alberta =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}

More information about the Corpora mailing list