why don't you just build one of each and make an interpolation of the two? I'm not familiar enough with Chinese to see if it forcibly makes sense linguistically, but the litereature in language modeling is full of results along the lines of "... and we interpolated these two models and got nice perplexity improvements", so it might be worth trying. If you're aiming for something *simple*, then a character-based model would probably be better since you wouldn't have to do word segmentation in the first place. (Note that character n-grams don't have much meaning in letter-based languages, and even in German where you have lengthy synthetic compounds, it's a nontrivial task to split them, which is why people traditionally just treat those words as a single unit).
-Y.
> Hello everyone,
> I want to build a chinese language model with a corpus of size 1.1G or so.
Now I have a question, is it better to count on the character level or on the
word level (or on a even higher level like phrases). Since the vocabulary
size of chinese word is much larger than that of character, the order of
character-based model may be higher than the word-based model. I made an
experiment with a smaller corpus, whose result shows that the ppl with
word-based model is much bigger than with character-based model, (at least
partially) because there are more OOVs in the first model than the second.
But if fine-granularity is preferred, why don't we model English on character
level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.
>