[Corpora-List] A Problem About Chinese Language Modeling

Philipp Koehn pkoehn at inf.ed.ac.uk
Tue Feb 10 14:30:32 CET 2009


in machine translation, we see benefits of word segmentation. The trade-off is that an n-gram over words is able to include more context than a n-gram over characters. There may be a problem with the perplexity numbers you compute. If it is average perplexity per token (as it usually is measured), then the higher perplexity for the word model over the character model is misleading. If this is the case, adjusting the word- based perplexity is more informative...


On Tue, Feb 10, 2009 at 10:58 AM, 张坤鹏 <smallfish at mail.nankai.edu.cn> wrote:
> Hello everyone,
> I want to build a chinese language model with a corpus of size 1.1G or so.
> Now I have a question, is it better to count on the character level or on
> the word level (or on a even higher level like phrases). Since the
> vocabulary size of chinese word is much larger than that of character, the
> order of character-based model may be higher than the word-based model. I
> made an experiment with a smaller corpus, whose result shows that the ppl
> with word-based model is much bigger than with character-based model, (at
> least partially) because there are more OOVs in the first model than the
> second. But if fine-granularity is preferred, why don't we model English on
> character level rather than word level?
> I am grateful if anyone can give me some suggestions on this problem.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list