[Corpora-List] The new 14 billion word iWeb corpus (from the BYU corpora)

Mark Davies Mark_Davies at byu.edu
Thu May 17 15:19:21 CEST 2018


We have just released the new 14 billion word iWeb corpus<https://corpus.byu.edu/iweb/>, which complements other BYU corpora<https://corpus.byu.edu/> such as COCA, COHA, NOW, BYU-BNC, GloWbE, Wikipedia, and EEBO.

At 14 billion words, iWeb is more than 25 times as large as the 560 million word COCA corpus. iWeb also has a much wider range of web-based materials than does COCA, since it is based on 22 million web pages in nearly 100,000 carefully selected websites (based on Alexa.com<https://www.alexa.com/topsites>, from Amazon).

New in iWeb is the ability to browse through the top 60,000 words in the corpus, and to search this list by word form, part of speech, rank (#1-60,000), and even pronunciation.

Most importantly, you can then see detailed information on each of the top 60,000 words in the corpus – definition, frequency information, synonyms and other related words (from WordNet, word families, MRC, etc), collocates (in a much improved format), related “topics” (perhaps much more useful than collocates), “clusters” (new in iWeb), relevant websites, and sample concordance/KWIC lines. There are extensive hyperlinks on each page, which allow you to quickly and easily move from one word to a number of related words.

In addition, for each of these 60,000 words, there are “quick links” to related data from other websites – pronunciation, additional definitions, images, videos, and translations (for more than 100 languages).

iWeb also allows you to quickly and easily create “virtual corpora” on nearly any topic, and these virtual corpora can then be searched as their own “stand-alone” corpora, or compared to other virtual corpora that you have created.

Finally, in terms of “standard” corpus searches, we note that (due to improvements in the corpus architecture) iWeb is faster than any of the other BYU corpora, and in most cases it is also much faster than other large, 10-20 billion word online corpora.

For a short overview of the corpus (in graphical format, with an emphasis on the new features), please see:

https://corpus.byu.edu/iweb/help/iweb_overview.pdf

We hope that this new corpus is useful to you in your teaching, learning, and research.

Best,

Mark Davies

BYU Corpora

============================================

Mark Davies

Professor of Linguistics / Brigham Young University

http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases **

** Historical linguistics // Language variation **

** English, Spanish, and Portuguese **

============================================

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6929 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180517/c214cc9d/attachment.txt>



More information about the Corpora mailing list