[Corpora-List] Ambiguous words in English and their frequency

Kevin Brubeck Unhammer unhammer at gmail.com
Thu Jan 26 10:53:11 CET 2012


Sebastian Hellmann <hellmann at informatik.uni-leipzig.de> writes:


> Hi Karen,
> I don't have an answer for your question, but I was intrigued how you
> would calculate proof for the claim:
> "in $language p1% of the words represent p2% of the ambiguity
>
> Here is my try:
>
> You would take a dictionary and then count the number of defined
> meanings per entry.
> Let's define that "ambiguity" only occurs in context and words (or
> tokens) with several meaning in a dictionary are called "polysemous".
> So all polysemous tokens would have more than one meaning in the
> dictionary.

Ambiguity could also mean plain morphological ambiguity, e.g. "a bank", a noun, vs "to bank", a verb used by airline pilots, and on a more fine-grained level: "to bank", infinitive, vs "we bank", present tense indicative non-3SG.

Morphologican ambiguity is easier to count than word-sense ambiguity since (1) corpora and taggers often don't go further, and (1) with word-sense ambiguity it's very hard to know how far to go ("river bank" vs "financial institution" is uncontroversial, but do you divide "building of financial institution" from "legal entity of financial institution"?). With morphological ambiguity, on the other hand, it is in most cases easy to test how ambiguous a form is[1]. With word-senses you need some framework (or dictionary/Wiktionary) to constrain you.

[1] At least if you stick to observed sentences and don't go

"but I could easily verb that noun" all the time.


> Then you take all polysemous words and create sensible surface forms
> (such as add plural 's' ).

Collecting all forms complicates things a bit; a word might be polysemous in singular, but monosemous (is that a word?) in plural. It happens with mass nouns, e.g. "paper" vs "papers", where the plural can't mean pieces of paper. (And then "bank" is no longer ambiguous between a noun and a verb.)

regards, Kevin Brubeck Unhammer



More information about the Corpora mailing list