I remember several discussions of inter-rater reliability
during my involvement in the first Senseval. I have found
some of the documentation at
No doubt Senseval-2 and Senseval-3 have also
discussed this topic?
Date: Mon, 15 Jul 2013 21:59:18 +0000 From: Mark Davies <Mark_Davies at byu.edu> Subject: [Corpora-List] WSD / # WordNet senses / Mechanical Turk To: "corpora at uib.no" <corpora at uib.no>
Sorry if this is a basic question for computational linguists; I'm a corpus linguist.
I'm wondering if there has been much research on inter-rater reliability of word sense disambiguation by raters on something like Mechanical Turk. For example:
-- Given some verbs that have 5 word senses each in WordNet (e.g. the words tag, tame, taste, temper), how well do native speakers agree on the word sense for these verbs in context -- How does this inter-rater reliability change for words that might have just two senses (e.g. the verbs taint, tamper, tan, tank) or maybe 10 senses (e.g. the verbs shift, spread, stop, trim). (In other words, intuition suggests that for words with two WordNet senses, there might be higher inter-rater reliability than those words with five senses, and that for words with 10 WN senses, inter-rate reliability would be pretty bad.) -- Semantically, which kinds of 2 / 5 / 10 WN entry words have the best inter-rater reliability, and which have the worst?
Thanks in advance.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================