[Corpora-List] Spellchecker evaluation corpus

Stefan Bordag sbordag at informatik.uni-leipzig.de
Thu Apr 14 10:48:31 CEST 2011


Dear Antal,

You are right, I didn't think it through to the last consequence. But once you put it like this, perhaps producing such a corpus wouldn't be so difficult after all. Perhaps all it takes is a custom plugin for Open Office which people can use when they review documents they write in OO for errors. In this plugin, simply by klicking some accept button provided by the plugin they'd consent to have both the original version and the revised version sent to some database known to the plugin. With some time perhaps a sizeable collection of all sorts of corrections in all sorts of languages could be produced by this. I certainly wouldn't feel any difficulties with sending both the uncorrected and corrected version of my papers to such a database. After all, one more place they'd be sort of published. :)

Best regards, Stefan

Am 14.04.2011 10:40, schrieb A.P.J. van den Bosch:
> Dear Stefan,
>
> All good points, but when you say
>
>> - several collections of misspelled words along with a defined context size of differing languages to evaluate spelling error detectors and correctors
> what do you mean with a defined context size? What seems to be missing from your list is what I think should be the ultimate evaluation setting: _full_ texts with _all_ errors annotated.
>
> Error list evaluations cannot measure the false alarm rate or precision of your spelling error detector: how often does it think it has found an error which isn't one? Put in another way, an algorithm with a great recall/accuracy on an error list may actually be an over-enthousiastic system that flags many normal words as errors as well.
>
> For fully-automatic correction and corpus cleanup this is quite vital - does your method do more harm than good? But also interactive spellcheckers could do with a higher precision; as one of the most widely used pieces of language technology worldwide, it's not particularly loved for its low precision.
>
> Antal
>
> --
> Antal van den Bosch Antal.vdnBosch at uvt.nl http://ilk.uvt.nl/~antalb/
> ILK / Tilburg center for Cognition and Communication, Tilburg University
>
>
>

-- ------------------------------------------- - Dr. Stefan Bordag - - 0341 49 26 196 - - sbordag at informatik.uni-leipzig.de - -------------------------------------------



More information about the Corpora mailing list