[Corpora-List] Westbury Lab English Wikipedia corpus now available. (April 2010 version)

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu May 20 23:57:49 CEST 2010


Dear Fellow Corpora List members:

In a similar style to our USENET corpus, we have just released the first version of a corpus extracted from the English Wikipedia. This was created from a snapshot taken in April, 2010. It is freely available immediately at the following URL:

http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html

There is a complete description of the corpus on the above web page, but here are a few quick points:

1) Size: over 900 million words in over 2.8 million documents. 2) Clean text, unprocessed and untagged. 3) Distributed as a single file (1.8Gb, compressed) with document delimiters. 4) CC license. Please read the licensing for this corpus and for Wikipedia carefully.

As always, it is available as a direct download to those on the Internet2. For normal Internet connections, we offer a BitTorrent download. If you use the BitTorrent download, please help us synchronize the swarm by commencing your download today, and leave your BitTorrent program running for a few days after you complete downloading the file. This will help others download the file, and help you create some good karma for yourself.

Your feedback is welcome and appreciated,

Cyrus

-- =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=} Cyrus Shaoul http://www.psych.ualberta.ca/~westburylab/ University of Alberta =[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



More information about the Corpora mailing list