[Corpora-List] Rovereto Twitter N-Gram Corpus

Amac Herdagdelen amac at herdagdelen.com
Mon Jan 9 20:33:10 CET 2012


Dear Corpora List Members,

I'm excited to announce that Rovereto Twitter N-Gram Corpus (RTC), an n-gram dataset of Twitter messages with gender labels of the authors and time of posting, is publicly available under a CC license. The corpus is based on 75 million English tweets collected from the public stream of Twitter, between December 2010 and July 2011. Instead of full text content of tweets, frequency statistics of n-grams are provided. For each n-gram, the frequencies are broken down by gender of the authors and posting time (i.e., day of the week and hour of the day). For details, you can visit the corpus homepage: http://clic.cimec.unitn.it/amac/twitter_ngram/

Thanks,

Amaç Herdağdelen



More information about the Corpora mailing list