[Corpora-List] Spellchecker evaluation corpus

Trevor Jenkins trevor.jenkins at suneidesis.com
Sat Apr 9 13:47:40 CEST 2011


On Sat, 9 Apr 2011, Stefan Bordag <sbordag at informatik.uni-leipzig.de> wrote:


> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find
> any such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? ...

I don't know of one per se other than my own dyslexically-motivated scribblings. Though I am heartened to see others have responded with a few datasets. But which language are you considering? And equally in what period? The orthography of English has changed over the years; there are celebrations organised this year for the King James Version translation of the Bible. Spelling rules then were different from today's. Similarly the rules of 200 years ago in Jane Austen's time were different. As English literature students should be familiar with both those items.


> ... I know I can make synthetic tests systematically inserting, swapping
> etc. letters in my own test data, but this would give me results which I
> cannot compare to any other results. ...

There are some perl/python/ruby scripts around to do those types of transpositions. The frequency of such alterations might well be listed in the text criticism literature. What are the observed errors in actual usage, etc..

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list