> I am working on a trilingual comparable corpus of French/English and
> What surprises me is that the number of unique nouns in Japanese is
> three times the number of unique nouns in English, even though the
> difference of total number of words in both language is not that large
> (the ratio for French/English is more consistant for example).
Another possible reason for the difference could be the way "nouns" are categorized in the three languages. If you used MeCab with the Ipadic dictionary, as recommended on MeCab's download site, then the POS category "??:??" (noun:general) includes not only proper and common nouns which correspond to nouns in English or French, but also categories such as 1) "??-????" (noun:verbal), nouns which can be used with the light verb "suru" to form a verb; 2) "??-??????" (noun:adjective-na), nouns which can be also used as adjectives by adding the postfix -na; 3) "??-???" (noun:pronoun); 4) "??-????" (noun:adverbal), nouns which can also be used as adverbials; 5) "??-???" (noun:bound) - nouns which can only be used in noun compounds and are then split (or oversplit, as Jim Breen and others suggested) into more units than the corresponding English or French nouns would.
I do not have exact numbers at hand, but the category noun:verbal is quite large and I think it could have influenced the difference in the total number of nouns, although that still does not account for the type/token ratio difference.
Kristina Hmeljak Sangawa kristina.hmeljak at guest.arnes.si Dept. of Asian and African Studies, Faculty of Arts, University of Ljubljana