I'm wondering if there has been much research on inter-rater reliability of word sense disambiguation by raters on something like Mechanical Turk. For example:
-- Given some verbs that have 5 word senses each in WordNet (e.g. the words tag, tame, taste, temper), how well do native speakers agree on the word sense for these verbs in context -- How does this inter-rater reliability change for words that might have just two senses (e.g. the verbs taint, tamper, tan, tank) or maybe 10 senses (e.g. the verbs shift, spread, stop, trim). (In other words, intuition suggests that for words with two WordNet senses, there might be higher inter-rater reliability than those words with five senses, and that for words with 10 WN senses, inter-rate reliability would be pretty bad.) -- Semantically, which kinds of 2 / 5 / 10 WN entry words have the best inter-rater reliability, and which have the worst?
Thanks in advance.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================