Word frequencies in English, French, German, ......

Jim Breen Jim.Breen at infotech.monash.edu.au
Fri Feb 16 04:08:00 CET 2007


Uwe Quasthoff <quasthoff_AT_informatik.uni-leipzig.de> wrote:

>> please have a look at http://corpora.informatik.uni-leipzig.de/download.html

>> You will find frequency lists as plain text (words.txt) and MySQL data

>> files (words) (sorry, not for Portuguese at the moment) calculated from

>> corpora of 100.000 to 3.000.000 sentences, depending on the language.

>> In addition, you can get the corpora and pre-calculated co-occurrences.


I finally got to glancing over the Japanese section of that project.
The frequency list seems fine for nouns, but is a bit of a mess for
inflected words such as nouns and adjectives. There are many partially
inflected forms appearing independently as "words", and many fragments
of inflectional endings also listed. For example "tabereba" (if [I] eat)
has been treated as the "word" _tabere_ and the particle _ba_, and not
in the root form (taberu). Few would class _tabere_ as a word in its own
right. This results in the frequencies for usage of verbs, etc. being
skewed, one would need to aggregate the counts for fragments to get
an accurate picture of the frequency.

I suspect the problem was with the segmenter used for the Japanese text
(you used Mecab?) Others, such as Chasen and Juman, lemmatize correctly
and would have given taberu as the root of tabereba.

Cheers

Jim

--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学




More information about the Corpora-archive mailing list