The lack of a standard corpus for evaluating English spellcheckers may not be all that surprising. Researchers focus on different aspects of spellchecking, and a corpus appropriate for testing one piece of work may be almost useless for testing another. Are we concentrating on detecting errors or can we take the error as given and concentrate on suggesting corrections? Are we happy to ignore real-word errors or, on the other hand, are they the focus of our research? Do we want to tackle the sort of misspellings made by people who have difficulty with spelling or are we correcting the occasional typo in otherwise correct text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if the errors are made by native speakers or second-language users of English? Is a set of errors (with their targets) adequate or is it essential to have context? If the latter, will a snippet of context do or do you need full documents? Do we want to correct running text or queries to a search engine? And so on.

My own work has focussed on trying to correct the mangled efforts of poor spellers. Years ago, I gathered various collections of misspellings and deposited them, with some documentation, in the Oxford Text Archive, who christened them the "Birkbeck error corpus". There is a file, derived from these, for download from my website, along with a couple of others:


More recently, my colleague Jenny Pedler has compiled a file specifically of real-word errors, in some context. This is also available for download:


