"We collect 10 annotations for each of 177 examples of the noun “president” for the three senses given in SemEval. [...] performing simple majority voting (with random tie-breaking) over annotators results in a rapid accuracy plateau at a very high rate of 0.994 accuracy. In fact, further analysis reveals that there was only a single disagreement between the averaged non-expert vote and the gold standard; on inspection it was observed that the annotators voted strongly against the original gold la-bel (9-to-1 against), and that it was in fact found to be an error in the original gold standard annotation.6 After correcting this error, the non-expert accuracy rate is 100% on the 177 examples in this task. This is a specific example where non-expert annotations can be used to correct expert annotations. "
Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch. Expectations of Word Sense in Parallel Corpora. NAACL Short. 2012. http://cs.jhu.edu/~vandurme/papers/YaoVanDurmeCallison-BurchNAACL12.pdf
"2 Turker Reliability
While Amazon’s Mechanical Turk (MTurk) has been been considered in the past for constructing lexical semantic resources (e.g., (Snow et al., 2008; Akkaya et al., 2010; Parent and Eskenazi, 2010; Rumshisky, 2011)), word sense annotation is sensi- tive to subjectivity and usually achieves low agree- ment rate even among experts. Thus we first asked Turkers to re-annotate a sample of existing gold- standard data. With an eye towards costs saving, we also considered how many Turkers would be needed per item to produce results of sufficient quality.
Turkers were presented sentences from the test portion of the word sense induction task of SemEval-2007 (Agirre and Soroa, 2007), covering 2,559 instances of 35 nouns, expert-annotated with OntoNotes (Hovy et al., 2006) senses. [...]
We measure inter-coder agreement using Krip- pendorff’s Alpha (Krippendorff, 2004; Artstein and Poesio, 2008), [...]"