[Corpora-List] Spellchecker evaluation corpus

Martin Reynaert reynaert at uvt.nl
Mon Apr 11 16:15:22 CEST 2011

Dear list,

We have so far heard interesting news about German and English spelling correction benchmarks. Thank you Yannick, Eric and John! And, of course, Stefan for bringing up this remarkable state of affairs!

About two years ago I contacted dr. A. Zamora to enquire whether the list of 50,000 English typo's and their corrected form he collected together with dr. J.J. Pollock in about 1983 could perhaps be made available. He informed me that it is 'lost in the mists of time'...

Another missed chance, I presume, was when the BNC was rejuvenated and `corrected' a couple of years ago. I sent in a list of over 3,000 typo's and won a personal copy of the new XML version. Given that and the original version, it might yet be possible to quickly derive a nice benchmark for English...

In the context of search-engine query spelling correction, Bing and Microsoft Research currently have a challenge running. ( Cf. http://web-ngram.research.microsoft.com/spellerchallenge/ ) A large training data set is provided. The test set, however, is not going to be released. For systems that require training, the MS training data might be used in 10-fold experiments, one left-out fold being used for testing. This would be another option open to us at this stage.

Trevor: In a 2006 paper I have published statistics about typographical errors gathered from the Reuters RCV1 corpus for English (12,094 pairs of errors /attested corrections, over 3,000 of these also occurred in the BNC) and from a corpus of Dutch newspapers (9,152 pairs). Both lists have grown considerably since as I have on and off put more effort into them. These lists are not to date publicly available due to IPR-issues, however. More about this later.

John's example from what is the newer version of what used to be more commonly known as the Birkbeck spelling error corpus by dr. Mitton, shows that it is geared far more to what are more cognitive errors rather than typographical errors. In fact this particular pair constitutes what are called 'confusables' aka real-word errors. I strongly agree with John that we need several kinds of benchmark sets and have written about that in an LREC paper in 2008 (available from http://ilk.uvt.nl/publications ).

Another resource that is sometimes used in evaluating spelling correction systems is the list provided by Kevin Atkinson, the maker of Aspell. These are isolated errors coupled to their alleged corrections. I have strong doubts about some of these pairings, e.g. *amification corrected as amplification (source: http://aspell.net/test/cur/batch0.tab). The point is that in any case for larger edit or Levenshtein distances (i.e. distance 2 in the particular example) between a non-word and its correct form, one needs to have the context the error appeared in. This is best exemplified by an example from my PhD work, where the non-word *onjections might have to be resolved to either 'injections' or 'objections'. (Available from: http://ilk.uvt.nl/~mre/).

(This is a laboratory sentence:)

Her vehement *onjections to these painful *onjections were based on solid medical evidence, as well as a hearty dislike of needles.

For English, I intend to someday, perhaps soon, initiate the necessary negotiations with LDC to find a solution for my RCV1 error list...

For Dutch, we are at ILK working on a large spelling (and other lexical) errors benchmark based on a selection of texts (up to book-length, from a variety of text types). IPR-issues for these texts have all been settled in the framework of SoNaR, the Reference corpus of contemporary written Dutch we are currently building. We will notify the list as soon as this benchmark is available.

In part to help facilitate building this benchmark, we are also currently proposing a new xml-format, called FoLiA (Format for Linguistic Annotation). More at: http://ilk.uvt.nl/folia

To conclude, I would like to repeat here what I have been proposing elsewhere, namely that we indeed need shareable benchmark sets, for a range of languages, but that we also need to work towards a consensus regarding the actual evaluation metrics we (should) use.

I would be interested in proposals for collaboration towards building benchmark sets for more languages.

Martin Reynaert ILK TiCC Tilburg University The Netherlands

More information about the Corpora mailing list