[Corpora-List] Wonky ngrams

Alon Lischinsky alischinsky
Fri Jan 4 13:58:19 CET 2013


On 04/01/13 12:04, Brett Reynolds wrote:


> Can anyone explain why "in spite of" would have a higher frequency than
> "in spite" in the following graph from Google ngrams?
> http://goo.gl/u7J3F


>From http://books.google.com/ngrams/info:

?What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are [the bigram sought]??

In other words: the frequencies are calculated over the total number of N-grams of the same length. Since the denominator in the calculation changes, a bigram and trigram that are expected to have almost identical distributions over the corpus (as in your example) can show slight differences in calculated frequency. (I suspect rounding errors play a role as well.)

Cheers,

A.



More information about the Corpora mailing list