[Corpora-List] Number of unique words in text for different languages

John F. Sowa sowa
Thu Aug 12 15:47:46 CEST 2010


On 8/12/2010 9:17 AM, Jim Breen wrote:
> Japanese morphological analysers such as MeCab, Chasen, etc. tend to
> over-split so that what might be considered a single word in English or
> French may end up as two or three elements in MeCab's output.

Over-splitting would increase the total word count, but reduce the count of unique words. The huge number of unique words that Emmanuel Prochasson found was probably the result of grouping long Kanji strings into a single so-called noun.

For example, English 'life insurance company employee' would count as 4 words, but the German 'Lebensversicherungsgesellschaftsangestellter' would be counted as just one word.

John Sowa



More information about the Corpora mailing list