[Corpora-List] A Problem About Chinese Language Modeling

张坤鹏 smallfish at mail.nankai.edu.cn
Tue Feb 10 11:58:51 CET 2009


Hello everyone,

I want to build a chinese language model with a corpus of size 1.1G or so. Now I have a question, is it better to count on the character level or on the word level (or on a even higher level like phrases). Since the vocabulary size of chinese word is much larger than that of character, the order of character-based model may be higher than the word-based model. I made an experiment with a smaller corpus, whose result shows that the ppl with word-based model is much bigger than with character-based model, (at least partially) because there are more OOVs in the first model than the second. But if fine-granularity is preferred, why don't we model English on character level rather than word level? I am grateful if anyone can give me some suggestions on this problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 835 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090210/2c4babed/attachment.txt



More information about the Corpora mailing list