[Corpora-List] the ebb and flow of inclusion of words in OED?

Mark Davies Mark_Davies at byu.edu
Tue Apr 26 19:53:22 CEST 2011

Martin Mueller wrote:

>> A much better source, to pick up from John Sowa's
suggestion, to would be the 30,000 EEBO texts that have been transcribed and the 40,000 that will be transcribed over the next four years. Do lemmatization and morphosyntactic analysis for every word and think of the combination of lemma and POS tag as an abstract entity whose orthographic manifestations can be put on a time line.

This could be done quite easily with the 400 million word Corpus of Historical American English (http://corpus.byu.edu/coha); 100,000 texts from fiction, popular magazines, newspapers, and other non-fiction. Backend (in the relational database), one could find, for example, all adjectives that occur at least three times in decade X (e.g. 1920s) that don't occur in any of the preceding decades (e.g. 1810s-1910s), and repeat this for each of the 20 decades in the corpus, to see the number of "new words" per decade.

Compared to the OED, COHA has the advantages that 1) it's more than 10 times as large (400 million vs 37 million words in the OED "corpus" of 2.2 million quotations), and 2) it is tagged and lemmatized (using CLAWS). The downside to COHA is that it's only 1810-2009.

Mark Davies

============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906 Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

More information about the Corpora mailing list