[Corpora-List] Now available: Downloadable COCA and GloWbE full-text corpus data

Mark Davies Mark_Davies at byu.edu
Mon Mar 17 14:44:58 CET 2014

At http://corpus.byu.edu/full-text/ you can now download full-text data for the following two corpora:

* Corpus of Contemporary American English<http://corpus.byu.edu/coca/> (COCA). 440 million words of downloadable text (190,000 separate texts). Balanced for genre — about 88 million words each of spoken, fiction, magazine, newspaper, and academic. With the included [sources] table, you can also search by sub-genre, e.g. News-Financial or Academic-Medicine.

* The corpus of Global Web-Based English<http://corpus2.byu.edu/glowbe/> (GloWbE). 1.8 billion words of downloadable text (1,800,000 separate texts). Divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

Of course with the full-text data from either corpus, you will have the actual corpora on your computer. As a result, you can do many things that would be difficult or impossible with the standard web interface<http://corpus.byu.edu/>, such as sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and so on.

The data comes in three different formats<http://corpus.byu.edu/full-text/formats.asp> (see samples<http://corpus.byu.edu/full-text/samples.asp>): data for relational databases (info<http://corpus.byu.edu/full-text/database.asp>), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data<http://corpus.byu.edu/full-text/purchase.asp>, you purchase the rights to any and all of these formats.


Mark Davies


