[Corpora-List] German n-gram files

Christian Pietsch chr.pietsch at googlemail.com
Fri Nov 18 13:55:24 CET 2011


Dear Naohiro,

as an alternative to the Google Web1T n-gram collection Yannick referred to, you might want to look at the Google Books n-gram collection which also includes a massive German dataset, and can be downloaded directly from http://books.google.com/ngrams/datasets .

Besides genre, there are other, more subtle differences between the two data collections, e.g. with respect to tokenization and punctuation handling. In addition to the n-gram and its frequency, the Books n-gram corpus includes the year of publication for each n-gram, so your colleague can filter the n-grams according to his or her definition of “contemporary”.

Regards, Christian

--

Christian Pietsch <http://purl.org/net/pietsch>



More information about the Corpora mailing list