All good points, but when you say
> - several collections of misspelled words along with a defined context size of differing languages to evaluate spelling error detectors and correctors
what do you mean with a defined context size? What seems to be missing from your list is what I think should be the ultimate evaluation setting: _full_ texts with _all_ errors annotated.
Error list evaluations cannot measure the false alarm rate or precision of your spelling error detector: how often does it think it has found an error which isn't one? Put in another way, an algorithm with a great recall/accuracy on an error list may actually be an over-enthousiastic system that flags many normal words as errors as well.
For fully-automatic correction and corpus cleanup this is quite vital - does your method do more harm than good? But also interactive spellcheckers could do with a higher precision; as one of the most widely used pieces of language technology worldwide, it's not particularly loved for its low precision.
-- Antal van den Bosch Antal.vdnBosch at uvt.nl http://ilk.uvt.nl/~antalb/ ILK / Tilburg center for Cognition and Communication, Tilburg University