* Corpus of Contemporary American English<http://corpus.byu.edu/coca/> (COCA). 440 million words of downloadable text (190,000 separate texts). Balanced for genre — about 88 million words each of spoken, fiction, magazine, newspaper, and academic. With the included [sources] table, you can also search by sub-genre, e.g. News-Financial or Academic-Medicine.
* The corpus of Global Web-Based English<http://corpus2.byu.edu/glowbe/> (GloWbE). 1.8 billion words of downloadable text (1,800,000 separate texts). Divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.
Of course with the full-text data from either corpus, you will have the actual corpora on your computer. As a result, you can do many things that would be difficult or impossible with the standard web interface<http://corpus.byu.edu/>, such as sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and so on.
The data comes in three different formats<http://corpus.byu.edu/full-text/formats.asp> (see samples<http://corpus.byu.edu/full-text/samples.asp>): data for relational databases (info<http://corpus.byu.edu/full-text/database.asp>), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data<http://corpus.byu.edu/full-text/purchase.asp>, you purchase the rights to any and all of these formats.
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9777 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20140317/f8f6e35a/attachment.txt>