Dear Mark, John,

Let me confess to a moment of embarrassment that I've been anxious about for years: following SENSEVAL-1 I did a (tiny) experiment to establish inter-annotator agreement, and came up with the 95% figure cited by John.

On experience since, I think the findings were not sound, and it is most unusual to get a figure that high, and I regret having published it (and, worse, having put it in the title of a short paper from EACL-99)

For either automatic WSD, or even for the gold standard, I agree entirely with John:

Miss Elliott, my high-school English teacher, wouldn't give
> anyone a gold star [for work like that]


>> Off the top of my head, here's two relevant studies on inter-rater
>> reliability for WSD, one for the case of expert annotators and one for
>> the case of non-experts:
>> http://link.springer.com/**article/10.1023/A:**1002693207386#page-1<http://link.springer.com/article/10.1023/A:1002693207386#page-1>
> From the abstract at the pointy end of this pointer:
>> The exercise identifies the state-of-the-art for fine-grained word sense
>> disambiguation, where training data is available, as 74–78% correct, with
>> a number of algorithms approaching this level of performance. For systems
>> that did not assume the availability of training data, performance was
>> markedly lower and also more variable. Human inter-tagger agreement was
>> high, with the gold standard taggings being around 95% replicable.
> Implication: For a 300-word page of text, a state-of-the-art program
> would have about 75 errors. That would be an average of two errors
> for 8-word sentences, or five errors for 20-word sentences.
> For the "gold" standard, there would still be 15 errors in a 300-word
> page. Miss Elliott, my high-school English teacher, wouldn't give
> anyone a gold star for 15 errors per page.
> John
