[Corpora-List] Spellchecker evaluation corpus

Yannick Versley versley at sfs.uni-tuebingen.de
Sat Apr 9 11:41:24 CEST 2011


Stefan,

The TüBa-D/Z treebank maintains the original spelling for the normal tokens and annotates spelling corrections in the comment field. This means that it can be used to train/test spell checkers (with a suitable split), and that the distribution of errors corresponds perfectly to the actual error rate in edited newspaper text. (It's less typical of the careless writing that you'll find in user-contributed web content, though).

Best, Yannick

On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag < sbordag at informatik.uni-leipzig.de> wrote:


> Hi everyone,
>
> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find any
> such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? I know I can make
> synthetic tests systematically inserting, swapping etc. letters in my own
> test data, but this would give me results which I cannot compare to any
> other results. Hence, is there some accepted evaluation forum which I am
> missing because whenever I include spell check in any form in search queries
> I get lots of tutorials how to write a spellchecker and almost nothing
> else...
>
> Best regards,
> Stefan Bordag
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2111 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110409/ac14839d/attachment.txt>



More information about the Corpora mailing list