[Corpora-List] Building a corpus from Twitter & Tw's privacy concerns

Miles Osborne miles at inf.ed.ac.uk
Thu Jul 18 10:43:43 CEST 2013


This is a bit of a digression but it also underlines why building a start-up (which is similar to doing academic Social Media research) using Twitter data is a very risky business. As a community we should try to identify other Social Media streams and so not be so dependent upon one company.

Adam: Privacy is key, I agree and is something that I am working on now.

Mechanisms for distributing data --whilst making guarantees about which information can be inferred from it-- should be the next step. Whether society as a whole allows for research using this data is a different question however and out of my control.

Miles

On 18 July 2013 09:33, Miguel Almeida <miguelbalmeida at gmail.com> wrote:


> Adam, Miles,
>
> I think another reason is so that Twitter can "black out" everyone else at
> any time in the future. It's a great (and very selfish and narrow-minded)
> idea: let the research community publish papers with your data, showing you
> how to find interesting stuff in your data (using taxpayer money!), and
> then if at some point you want to black them out, use the kill switch.
>
> I don't think Twitter's owners care that much about reproducible research.
> ;)
>
> Miguel
>
>
> On Thu, Jul 18, 2013 at 9:26 AM, Adam Kilgarriff <adam at lexmasterclass.com>wrote:
>
>> Miles,
>>
>> > acts as a barrier to research. Additionally one could argue that
>> preventing people from having access to static Tweet corpora
>> > undermines doing reproducible research.
>>
>> You can argue all you like but it's a bit irrelevant - the data privacy
>> battleground is the whole wide world, with hi-tech companies, politicians
>> and the media playing for big prizes, and they really won't care one jot
>> what us worker ants think (or if they trample us)
>>
>> adam
>>
>> On 18 July 2013 08:55, Miles Osborne <miles at inf.ed.ac.uk> wrote:
>>
>>> Basically Twitter's insistence on distributing IDs and not raw Tweets
>>> stems from the fact that third parties need to honour deletion requests.
>>>
>>> If you pass around raw Tweets then there is no way for Twitter to argue
>>> that a deleted Tweet is deleted. If instead you force people to recrawl
>>> them each time then Tweets can be deleted at source and all subsequent
>>> access requests will not return that deleted Tweet.
>>>
>>> Personally I think this way of distributing Tweets in bulk is not
>>> scalable and acts as a barrier to research. Additionally one could argue
>>> that preventing people from having access to static Tweet corpora
>>> undermines doing reproducible research.
>>>
>>> Miles
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>
>>
>> --
>> ========================================
>> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
>> adam at lexmasterclass.com
>> Director Lexical Computing Ltd<http://www.sketchengine.co.uk/>
>>
>> Visiting Research Fellow University of Leeds<http://leeds.ac.uk>
>>
>> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>>
>> *DANTE: a lexical database for English<http://www.webdante.com>
>> *
>> ========================================
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6144 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130718/818b139c/attachment.txt>



More information about the Corpora mailing list