[Corpora-List] the ebb and flow of inclusion of words in OED?

Martin Reynaert reynaert at uvt.nl
Tue Apr 26 16:28:23 CEST 2011



> It would be interesting to to do a more detailed study of word
> creation and disuse by going back to the original documents,
> when more of them become digitized.
>
> John Sowa
>
Dear John,

Just as a general note of warning on this... The examples are Dutch, but sobering nevertheless.

The Dutch National Library is putting online 8 million pages of digitized newspapers. It is a delightful collection going back to 1618, available for free to all.

If you go to

http://kranten.kb.nl/

and type in the query 'atoomschip' or even 'atoomtram' you will get a range of hits from about 1900 to 1927. The terms translate as 'nuclear ship' and 'nuclear tram'...

These are, of course, simple 's' to 'a' OCR-misrecognition errors, steam ships ('stoomschepen') and steam trams ('stoomtrams') being common at the time ;0) Also an example of real-world real-word errors far more interesting than the 20 or so word confusion sets being used in most research on context-sensitive spelling correction research.

Yours,

Martin



More information about the Corpora mailing list