romaji word list

Jim Breen Jim.Breen at infotech.monash.edu.au
Mon Dec 11 08:22:00 CET 2006



>> I've been searching around the web for a word frequency list for Japanese,

>> in romaji (Latin letters). I haven't had any luck but I did find a nifty

>> web based converter from kanji, hiragana, katakana to romaji, and thought I

>> would pass on the address.

>>

>> http://kanjidict.stc.cx/kakasifilt.php


There are a few frequency-ranked word lists around. The Japanese
"National Language Research Institute" (NLRI), which is an agency of
the Education Ministry, has published them from time to time.

A free public domain ranked list is available from my FTP site
(go to http://ftp.cc.monash.edu.au/pub/nihongo/00INDEX.html and search for
"wordfreq"). It is based on 4 years of the Mainichi Shimbun from the
mid-1990s. Of course it is not in romaji, but if you must work in romaji
(I can't imagine why) you could probably convert the file to
romaji using utility software such as Kakasi (on which the site you
mention is based.) There are Windows and Linux versions of Kakasi.
As you are possibly aware, Japanese has oodles of homophones, so
converting the words into romaji will obliterate the differences between
them.


>> It's a bit time consuming and balanced this corpus might not be. But if you

>> are looking for stop words, like I am, this method seems to work ok - better

>> than no frequency list at all. And if there is a romaji frequency list out

>> there, could you let me know :)


I'm curious to know what you think would be "stop words" in Japanese.
The particles (joshi) are likely candidates, and if you look at the
"wordfreq" they are at the head of the list (no surprises there.) Of
course, it all depends on the definition of "word", which is a rather
fuzzy concept in Japanese.

Cheers

Jim

--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学




More information about the Corpora-archive mailing list