I've got some code that implements the Mihalcea-style all-words graph WSD algo; I am willing to share it, although it is quite messy and was not designed for general public use. It *does* rank the senses from most to least likely. (FWIW, I suspect the freeling code could be easily modified to do the same). However ...
However, anyone who has tried to do this will have run into a certain problem; I am not quite sure of what the best/most correct solution is ... can anyone help with that?
The graph-based algos start by assigning equal probability to all possible senses, and adding weighted edges between (most) word-sense pairs. They then solve what is more-or-less the Markov chain for the graph.
The result is a redistribution of the probabilities for each possible word sense -- those with the highest probabilities are then spit out as the "answer". However, and this is the "problem": the probabilities redistribute over *all* words, and not just all senses for one word. So, for example, the highest-ranked sense for one word in a sentence may have p=0.01 while that for a different word may have p=0.0001. Since these are the senses with the highest probability *for that word*, they are the sense that are printed out (e.g. by freeling). But what is the correct way of ranking? What is the correct way for computing "the probability of being right"? The "confidence of being correct"?
If p_w,s is the output of the algo for word w, sense s. should I assume that the "probability Q_w,s of sense s for word w being right" is equal to:
Q_w,s = p_w,s / (sum_over_t p_w,t)
i.e. normalizing as if this was a conditional probability? Or maybe there is some other assignment that is more accurate, in that it maximizes some correctness measure? For example, say we define "correctness" for the entire sentence as
C = sum_over_w,s Theta_w,s Q_w,s
where Theta_w,s=1 for the true, correct sense, and Theta_w,s=0 for all other (incorrect) sense assignments.
Does Q_w,s maximize this correctness? Is there some other function that maximizes it? The above formula has a kind-of trap: one very strongly (but incorrectly) ranked word can outweigh three weakly (but correctly) ranked words...
In summary: the graph WSD algos provide good qualitative output, but I don't understand how to interpret the results in a more quantitative fashion.
--linas