[Corpora-List] A Problem About Chinese Language Modeling

Yannick Versley yannick.versley at unitn.it
Tue Feb 10 13:20:36 CET 2009

Hello 张坤鹏,

why don't you just build one of each and make an interpolation of the two? I'm not familiar enough with Chinese to see if it forcibly makes sense linguistically, but the litereature in language modeling is full of results along the lines of "... and we interpolated these two models and got nice perplexity improvements", so it might be worth trying. If you're aiming for something *simple*, then a character-based model would probably be better since you wouldn't have to do word segmentation in the first place. (Note that character n-grams don't have much meaning in letter-based languages, and even in German where you have lengthy synthetic compounds, it's a nontrivial task to split them, which is why people traditionally just treat those words as a single unit).

> Hello everyone,
> I want to build a chinese language model with a corpus of size 1.1G or so.
Now I have a question, is it better to count on the character level or on the word level (or on a even higher level like phrases). Since the vocabulary size of chinese word is much larger than that of character, the order of character-based model may be higher than the word-based model. I made an experiment with a smaller corpus, whose result shows that the ppl with word-based model is much bigger than with character-based model, (at least partially) because there are more OOVs in the first model than the second. But if fine-granularity is preferred, why don't we model English on character level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.

More information about the Corpora mailing list