[Corpora-List] Number of unique words in text for different languages

Jim Breen jimbreen
Fri Aug 13 01:21:10 CEST 2010


John F. Sowa wrote:
> On 8/12/2010 9:17 AM, Jim Breen wrote:
>> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
>> over-split so that what might be considered a single word in English or
>> French may end up as two or three elements in MeCab's output.


> Over-splitting would increase the total word count, but reduce the
> count of unique words. The huge number of unique words that
> Emmanuel Prochasson found was probably the result of grouping
> long Kanji strings into a single so-called noun.

In fact MeCab splits long kanji strings into the component words. For example "kikanshizensokuchiryouyaku" (antiasthmatic drug) is typically split: kikanshi + zensoku + chiryou + yaku. (I say "typically", because you have to use a trained lexicon with MeCab, and there are several to choose from.)

Perhaps Emmanuel combined sequences of noun-tagged morphemes, but even then I don't think it could have got such a high unique word count.

Jim

-- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Treasurer: Hawthorn Rowing Club, Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne



More information about the Corpora mailing list