Searching Japanese corpora

Jim Breen Jim.Breen at
Sat Dec 23 00:50:01 CET 2006

I stopped reading the grep thread, so I missed the start of this one...

>> From: Cyrus Shaoul <>

>Eric J. M. Smith wrote:

> Following up on our recent thread about grep with Unicode, I'm curious

> about how people search for text in Japanese-language corpora.


> My understanding of Japanese is rudimentary, but is it not possible

> (potentially at least) for the same text to be written in hiragana,

> katakana, or kanji? In order to find all occurrences of a particular

> string in a corpus, would I have to do the search 3 times, once for

> each script? I assume that would be the case for something like grep.

> But are there more sophisticated query tools which abstract away the

> question of which script is actually used for data within the corpus?

>> It is my understanding that it is possible to write the pronunciation of all

>> kanji and kanji compounds in both hiragana and katakana (and each

>> kanji/kanji compound can

>> have multiple pronunciations).

Most kanji have two or more pronunciations, however words written
in kanji almost always have just one pronunciation.

>> In most types of written Japanese, it

>> would be uncommon to write the pronunciation for kanji, and there are

>> many words that are

>> always written in katakana or hiragana, and never in kanji, so when

>> searching for words, having a tool that

>> would automatically search for a kanji word and it's kana

>> representations at the same time would not

>> be that useful.

Another complication is that often part or a word can optionally be
written in kana (known as "okurigana"). Also in compound verbs, the
second part of the verb is often written only in kana. From my
observation the two forms appear in approximately equal frequencies, but
it depends entirely on the source of the texts.

>> I should confess that there are some words that are written in both

>> kanji and kana with higher frequency, such as

>> some older loanwords, some place names, some proper names, some

>> low-frequency kanji, and a few other types of words.

>> I have a gut feeling that the number of words that fall into these

>> categories is not that large.


>> I don't know of any tools out there to do the kind of query you

>> mentioned, but it has been a few years since I

>> working on Japanese text. In the meantime, I can only suggest making

>> many queries, one with kanji/kanji compund and

>> others with the hiragana and katakana spellings of all the possible

>> pronunciations.

That may not be very productive, but for words with common written
variants (and here I mean about 15% of the Japanese lexicon) you may
have to try all alternatives. In some cases search engines try and
canonicalize both the indexing and searching for variant spellings of
loanwords. For example the loanword for diamond can be either
"daiamondo" or "daiyamondo", and some search engines treat them as the
same. The number of words treated in that way seems to be rising. At
ACL/Coling this year I raised the topic of canonicalizing such things as
okurigana variants with a staff member of one of the search-engine
companies. He said there was some resistance to this as some people
wanted to search on a particular form.

>> From: Brett Powley <>

>> I'm sure there's a qualified Japanese speaker out there who can tell

>> us this with authority (I'm not that person), but my understanding is

>> that there is a canonical form for words.

Well, the Education Ministry would like to think there is. From what one
reads and observes, much of the populace ignore them.

>> Katakana is used exclusively for foreign words

No way. "Predominantly" maybe, but you only have to glance at a manga,
or read a paper on a topic such as botany or entomology to see a lot
of non-loanwords in katakana.

>> Kanji (+ Hiragana modifers) is used for Japanese words.

>> Words are only spelt out in Hiragana in beginners' and learners'

>> texts, normally in small type above the canonical Kanji form.

The reading (i.e. pronunciation) of unusual words and neologisms is
sometimes written in parentheses after the word. For example in the
recent controversy about the enshinement of the spirits of war criminals
in the Yasukuni Shrine, the word "bunshi" almost always was spelled out
this way. In the Japanese Wikipedia many article names are spelled out
as well. Searching for the combination of a kanji word and its possible
reading is often a good way of confirming the reading, which is very
useful when collecting neologisms.


Jim Breen

Jim Breen
Clayton School of Information Technology, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学

More information about the Corpora-archive mailing list