[Corpora-List] social media corpus collected during 2013-2015

Min-Yen Kan knmnyn at gmail.com
Mon Sep 7 16:38:22 CEST 2015

Hi all:

I am one of the maintainers of the aforementioned SMS corpus at NUS. The address is not correct any longer, you may try here (or use your favorite search engine to find it):


In general, social media messages are difficult to share since many of the platforms strictly state that researchers and other parties are not to share messages directly. You may be able to obtain post IDs which you can then use the platform's API to obtain yourself. This is true with Sina Weibo (Chinese) as well as Twitter (English, various languages) but I gather the same is true of other languages.

<self-plug>Our group has also done a recent cross OSN study of OSN usage. You might be interested in it -- the corpus (the IDs of the posts) will be made available soon.

Bang Hui Lim, Dongyuan Lu, Tao Chen and Min-Yen Kan (2015). #mytweet via Instagram: Exploring User Behaviour across Multiple Social Networks. To be published in the Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM '15), Paris, France. http://www.comp.nus.edu.sg/~kanmy/papers/asonam15.pdf </self-plug>

By the way, it seems Kayee's message has been pushed out to the list several times, with little variation. Not sure that is the intention -- since the list has several thousand people's contacts... Cheers,


-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy at comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Sat, Sep 5, 2015 at 6:46 PM, LEE, kayee [12118192d] <kayee.lee at connect.polyu.hk> wrote:
> Thanks Rob for providing so many useful links.It would be of valuable asset
> to my lexical study which is interested in the new usages, new spellings and
> new words in social media.
> If anyone is doing the similar study as mine, you can go through the links
> which provided below.
> I am looking for a few more social media corpus If you know any other
> social media corpus collected during 2013-2015 and are compiled in English,
> please let me know.
> Thanks.
> Kayee LEE
> ________________________________
> 寄件者: rob van der goot <robvanderg at live.nl>
> 寄件日期: 2015年8月31日 上午 12:53
> 收件者: LEE, kayee [12118192d]
> 主旨: RE: [Corpora-List] Looking for a social media corpora collected in
> 2013-2015 (Kayee LEE KA LAM)
> Deat Kayee,
> Those files move all the time, I got the updated links here:
> Lexnorm, (is old, before 2013 I think)
> http://people.eng.unimelb.edu.au/tbaldwin/etc/lexnorm_v1.2.tgz
> Lexnorm 2015, is not in the overview, but is newer.
> https://noisy-text.github.io/files/lexnorm2015.tgz
> I think the sms messages are also from before 2013, but if you are still
> interested:
> http://www.comp.nus.edu.sg/~nlp/corpora.html
> Pos-tagged tweets:
> bit.ly/twitter-bootstrap-corpus
> Another interesting corpus might be the encow corpus (from 2014).
> https://webcorpora.org/
> Or you can always collect you own tweets,
> https://dev.twitter.com/rest/public
> For some of the corpora you do have to contact the creators.
> Good luck with them,
> Rob van der Goot
> Disclaimer:
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose. If you are not the intended
> recipient, you should delete this message and notify the sender and The Hong
> Kong Polytechnic University (the University) immediately. Any disclosure,
> copying, or distribution of this message, or the taking of any action based
> on it, is strictly prohibited and may be unlawful.
> The University specifically denies any responsibility for the accuracy or
> quality of information obtained through University E-mail Facilities. Any
> views and opinions expressed are only those of the author(s) and do not
> necessarily represent those of the University and the University accepts no
> liability whatsoever for any losses or damages incurred or caused to any
> party as a result of the use of such information.
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list