[Corpora-List] Ambiguous words in English and their frequency

Karen Fort karen.fort at inist.fr
Thu Feb 2 11:25:18 CET 2012


Hi all,

I could not find the time to precise my question and then received a lot of very interesting answers and references. Thank you all for this!

In fact, I should have said that I'm looking for the number of ambiguous word tokens in terms of POS in an English corpus, for example from the Penn TreeBank. One solution would be to compute this myself from the Brown corpus, but I was curious if there was a ref. on this.

I found this ref for French that says 60% of the French tokens in their corpus were non ambiguous in terms of POS: Tzoukermann, E.; Radev, D. R. & Gale, W. A. Ken Church, Susan Armstrong, P. I. E. T. & Yarowsky, D. (ed.) Natural Language Processing Using Very Large Corpora Tagging french without lexical probabilities -- combining linguistic knowledge and statistical learning Kluwer Academic, 1999

Of course, it all depends on the number of tags, their refinement et so on. It only gives a very rough idea and should be taken in its context, obviously. But that's all I need.

Best,

Karen

Le 26/01/2012 10:39, Eckhard Bick a écrit :
> Hello again,
>
> I forgot to add, that the ambiguous word tokens in my English test run
> amounted to 49.8%.
>
> Best,
> Eckhard
>
> On 2012-01-25 20:33, FORT, Karen wrote:
>> Hi all,
>>
>> I need to find this information (the proportion of ambiguous words in English and their frequency).
>> For example, we know that in French 8% of the words represent 30% of the ambiguity.
>> Of course, it's very rough, but it's only to have a rough idea.
>>
>> Can somebody help me with this (of course, I searched for a ref but could not find anything precise)?
>>
>> Thank you in advance,
>>
>> Regards,
>>
>>
>> Karën FORT
>> Ingénieure/Engineer et/and doctorante/PhD student
>> INIST-CNRS / LIPN
>> 2, allée de Brabois
>> 54500 Vandoeuvre-lès-Nancy
>> France
>> Bureau/Office: H112
>> +33 (0)3 83 50 46 36
>>
>> http://www-lipn.univ-paris13.fr/~fort/
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>

-- Karën FORT Ingénieure/Engineer et/and doctorante/PhD student INIST-CNRS / LIPN 2, allée de Brabois 54500 Vandoeuvre-lès-Nancy France Bureau/Office: H112 +33 (0)3 83 50 46 36

http://www-lipn.univ-paris13.fr/~fort/



More information about the Corpora mailing list