[Corpora-List] Building a corpus from Twitter & Tw's privacy concerns

John D. Burger john at mitre.org
Tue Jul 16 17:44:08 CEST 2013


There appears to be no legal reason you can't collect a corpus of tweets. However, per Twitter's Terms of Use you cannot redistribute the tweets to others. A common practice is to instead distribute the tweet IDs, which other people can use to fetch the tweets using Twitter's API. This is how NIST "distributes" the data in their Tweets2011 corpus:

http://trec.nist.gov/data/tweets/

This is less than optimal for research, though, since in the interim some of the Twitter users may have deleted tweets in the collection. For a sufficiently large corpus, this means that anybody else attempting to use the same data at a later date will almost certainly end up with a subset of your corpus. As far as I know, however, this is currently the only legal method for sharing tweets.

- John Burger

MITRE

On Jul 16, 2013, at 10:51 , M.E.Sciubba wrote:


> Dear ListMembers,
>
> I'd like to create a corpus of Italian twits, but searching online I found out that it is not possible anymore because Twitter has changed its privacy settings.
>
> Has any of you tried to build a Twitter corpus and how?
>
> Any suggestion will be much appreciated (considering that I am not a programmer, though).
>
> Best,
>
> Eleonora
>
>
>
> Dr. Maria Eleonora Sciubba
> Associate Researcher
> Archivio di LInguA Spontanea
> tel. +32 16 3 24795
> cell +32 483 616 114
>
> KU Leuven – Faculty of Arts
>
> Department of French, Italian and Comparative Linguistics
>
> Blijde-Inkomststraat 21, PO BOX 3308
>
> B - 3000 Leuven
>
> http://www.kuleuven.be/wieiswie/nl/person/00088846
>
>
>
>
> Be green. Keep it on the screen
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4274 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130716/66757517/attachment.txt>



More information about the Corpora mailing list