[Corpora-List] the ebb and flow of inclusion of words in OED?

Martin Mueller martin.mueller at mac.com
Tue Apr 26 16:41:42 CEST 2011


The coverage of English before 1900 in the OED is largely the work of Victorian and Edwardian editors, including a lot of amateur clerical crowdsourcers.

If you look at the timeline from a distance, you may conclude that it reflects the literary tastes of the editors for whom the late 16th and early seventeenth century were a golden age. Then you see a lot of 19th century stuff, reflecting the industrial revolution and all the words that science and technology brought with them.

The OED is probably not a particularly good source for studying the growth of vocabulary. A much better source, to pick up from John Sowa's suggestion, to would be the 30,000 EEBO texts that have been transcribed and the 40,000 that will be transcribed over the next four years. Do lemmatization and morphosyntactic analysis for every word and think of the combination of lemma and POS tag as an abstract entity whose orthographic manifestations can be put on a time line. Complicate the picture by using the English Short Title Catalogue to produce granular forms of text classification so that words sit in a space of time and genre.

Some of that work is going on in Project Bamboo, but there is room for many helping hands.

On 4/26/11 8:51 AM, "John F. Sowa" <sowa at bestweb.net> wrote:


>On 4/25/2011 5:12 PM, chris brew wrote:
>> I think part of the 1600 bump must correspond to William Shakespeare
>> (1564-1616, first folio published 1623, second folio published 1632)
>> and that a corresponding bump from 1380-1400 corresponds to Chaucer (you
>> have to set the granularity to 10 years to see it clearly)
>>
>> Something else happened in the 1650-1659 decade. I have a plausible
>> hypothesis but no more...
>
>Those are interesting hypotheses about the effects of literature
>and the methods of recording, distribution, and preservation.
>
>Some of those effects are probably distorted by historical accidents
>of loss and preservation. But the decisions of editors about which
>sources to consider would also influence the results.
>
>Ted Pedersen:
>> ... there are local peaks around the years 1400, 1600, and 1900,
>> with valleys around 1500, 1750, and the present day.
>
>I can't believe that the present day with the huge expansion
>of the WWW is a true valley. And the valley around 1750 was
>a period of active colonization that may have produced many
>words that weren't recorded in the OED sources.
>
>It would be interesting to to do a more detailed study of word
>creation and disuse by going back to the original documents,
>when more of them become digitized.
>
>John Sowa
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list