[Corpora-List] Spellchecker evaluation corpus

Stefan Bordag sbordag at informatik.uni-leipzig.de
Thu Apr 14 08:42:15 CEST 2011

Hi Roger,

Thanks for this valuable input.

I imagine, however, that it wouldn't be conceptually difficult to set up a test that covers most or all of these needs you mentioned. A proper evaluation setup for spellchecking in general would consist of: - several collections of misspelled words along with a defined context size of differing languages to evaluate spelling error detectors and correctors - collections differing by source of error: errors made by dyslectics are different from errors introduced by OCR which are in turn different from errors introduced by writing SMS, which are different from errors introduced by beginners trying to write scientific papers, etc. - several collections of string pairs (wrong to correct) in several different languages to evaluate context-free spelling correction algorithms (though the previous collections could be used for that as well) - it should distinguish between spell checkers that need training data to learn to properly detect or correct errors and those that don't need any explicit training data (such as the Lucene spell checker) - it should also take resource usage into account - the Lucene spell checker is much more memory resource intensive compared to a simple edit distance searcher, which, however, might use more CPU time. - the different languages covered should contain languages from different language families, as well as contain languages with non-concatenative morphology or languages such as Chinese.

I would wager that once such a rounded collections of different aspects has been made it would very well be possible to generate absolute statements about which algorithm covers which areas to which extent.

Additionally, with today's abundance of internet bandwidth and CPU resources, it shouldn't be difficult to set up an evaluation webservice which allows the author of some new algorithm to test it against the webservice. This way the evaluation instance wouldn't even have to make the data freely available as such. Not all of it, anyway. Quite similar to the microsoft spell checker competition which has been mentioned here, but without the legal regulations that make them the owner of your algorithm once you want to participate...

A very similar approach taken by the Morpho Challenge [1] has helped to discover (among many other things) that some algorithms, while producing excellent results in English, might really fail in Turkish, for example.

Best regards, Stefan

[1] http://research.ics.tkk.fi/events/morphochallenge2010/

Am 13.04.2011 20:14, schrieb Roger Mitton:
> A late response to Stefan's posting of 9 Apr.
> The lack of a standard corpus for evaluating English spellcheckers may not be
> all that surprising. Researchers focus on different aspects of spellchecking,
> and a corpus appropriate for testing one piece of work may be almost useless for
> testing another. Are we concentrating on detecting errors or can we take the
> error as given and concentrate on suggesting corrections? Are we happy to ignore
> real-word errors or, on the other hand, are they the focus of our research? Do
> we want to tackle the sort of misspellings made by people who have difficulty
> with spelling or are we correcting the occasional typo in otherwise correct
> text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if
> the errors are made by native speakers or second-language users of English? Is a
> set of errors (with their targets) adequate or is it essential to have context?
> If the latter, will a snippet of context do or do you need full documents? Do
> we want to correct running text or queries to a search engine? And so on.
> My own work has focussed on trying to correct the mangled efforts of poor
> spellers. Years ago, I gathered various collections of misspellings and
> deposited them, with some documentation, in the Oxford Text Archive, who
> christened them the "Birkbeck error corpus". There is a file, derived from
> these, for download from my website, along with a couple of others:
> http://www.dcs.bbk.ac.uk/~roger/corpora.html
> More recently, my colleague Jenny Pedler has compiled a file specifically of
> real-word errors, in some context. This is also available for download:
> http://www.dcs.bbk.ac.uk/~jenny/resources.html
> Roger Mitton
> Birkbeck, University of London
> On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag
> <sbordag at informatik.uni-leipzig.de> wrote:

-- ------------------------------------------- - Dr. Stefan Bordag - - 0341 49 26 196 - - sbordag at informatik.uni-leipzig.de - -------------------------------------------

More information about the Corpora mailing list