> Over-splitting would increase the total word count, but reduce the
> count of unique words. The huge number of unique words that
> Emmanuel Prochasson found was probably the result of grouping
> long Kanji strings into a single so-called noun.
In fact MeCab splits long kanji strings into the component words. For example "kikanshizensokuchiryouyaku" (antiasthmatic drug) is typically split: kikanshi + zensoku + chiryou + yaku. (I say "typically", because you have to use a trained lexicon with MeCab, and there are several to choose from.)
Perhaps Emmanuel combined sequences of noun-tagged morphemes, but even then I don't think it could have got such a high unique word count.
Jim
-- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Treasurer: Hawthorn Rowing Club, Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne