[Corpora-List] Spellchecker evaluation corpus

Roger Mitton roger at dcs.bbk.ac.uk
Wed Apr 13 20:14:18 CEST 2011

A late response to Stefan's posting of 9 Apr.

The lack of a standard corpus for evaluating English spellcheckers may not be all that surprising. Researchers focus on different aspects of spellchecking, and a corpus appropriate for testing one piece of work may be almost useless for testing another. Are we concentrating on detecting errors or can we take the error as given and concentrate on suggesting corrections? Are we happy to ignore real-word errors or, on the other hand, are they the focus of our research? Do we want to tackle the sort of misspellings made by people who have difficulty with spelling or are we correcting the occasional typo in otherwise correct text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if the errors are made by native speakers or second-language users of English? Is a set of errors (with their targets) adequate or is it essential to have context? If the latter, will a snippet of context do or do you need full documents? Do we want to correct running text or queries to a search engine? And so on.

My own work has focussed on trying to correct the mangled efforts of poor spellers. Years ago, I gathered various collections of misspellings and deposited them, with some documentation, in the Oxford Text Archive, who christened them the "Birkbeck error corpus". There is a file, derived from these, for download from my website, along with a couple of others:


More recently, my colleague Jenny Pedler has compiled a file specifically of real-word errors, in some context. This is also available for download:


Roger Mitton Birkbeck, University of London

On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag <sbordag at informatik.uni-leipzig.de> wrote:

> Hi everyone,
> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find any
> such test set for the spell checking task!
> Am I doing something wrong or is there no such data set? I know I can make
> synthetic tests systematically inserting, swapping etc. letters in my own
> test data, but this would give me results which I cannot compare to any
> other results. Hence, is there some accepted evaluation forum which I am
> missing because whenever I include spell check in any form in search queries
> I get lots of tutorials how to write a spellchecker and almost nothing
> else...
> Best regards,
> Stefan Bordag

More information about the Corpora mailing list