[Corpora-List] Ambiguous words in English and their frequency

Sebastian Hellmann hellmann at informatik.uni-leipzig.de
Thu Jan 26 09:58:43 CET 2012


Hi Karen, I don't have an answer for your question, but I was intrigued how you would calculate proof for the claim: "in $language p1% of the words represent p2% of the ambiguity

Here is my try:

You would take a dictionary and then count the number of defined meanings per entry. Let's define that "ambiguity" only occurs in context and words (or tokens) with several meaning in a dictionary are called "polysemous". So all polysemous tokens would have more than one meaning in the dictionary.

Then you take all polysemous words and create sensible surface forms (such as add plural 's' ). Then you would need to take another corpus that counts word/token probabilities in real life texts.

Then you can calculate what share polysemous tokens take in overall word usage, right? So in English that would be quite a lot: http://en.wiktionary.org/wiki/the has several meanings and makes up around 7% of all words in the Brown Corpus. There you would have my first hypothesis: "in English the word 'the' represents 7% of the ambiguity"

Overall, it is a really nice question, as it can only be answered by corpus analysis. Any human rater would probably not consider 'the' ambigous without a certain sensitivity to linguistics.

I am currently trying to integrate Wortschatz and Wiktionary via RDF and will try to actually calculate, what I sketched above. It is a very interesting question and can also be used to measure coverage and completeness of dictionaries.

All the best, Sebastian

On 01/25/2012 08:33 PM, FORT, Karen wrote:
> Hi all,
>
> I need to find this information (the proportion of ambiguous words in English and their frequency).
> For example, we know that in French 8% of the words represent 30% of the ambiguity.
> Of course, it's very rough, but it's only to have a rough idea.
>
> Can somebody help me with this (of course, I searched for a ref but could not find anything precise)?
>
> Thank you in advance,
>
> Regards,
>
>
> Karën FORT
> Ingénieure/Engineer et/and doctorante/PhD student
> INIST-CNRS / LIPN
> 2, allée de Brabois
> 54500 Vandoeuvre-lès-Nancy
> France
> Bureau/Office: H112
> +33 (0)3 83 50 46 36
>
> http://www-lipn.univ-paris13.fr/~fort/
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org



More information about the Corpora mailing list