[Corpora-List] Wikileaks Pager Corpus

Trevor Jenkins trevor.jenkins at suneidesis.com
Tue Apr 12 11:41:35 CEST 2011


Having mentioned, in a response to Laura's request for text messages, the Wikileaks 9/11 pager corpus and how difficult it can be to locate I went in search of it.

This page http://mirror.wikileaks.info/wiki/911/ will get you to the data.

The data is segmented into separate files of 5 minute time slices(*). Coverage is said to be continuous from 3AM on 11th September to 3AM on the 12th. I have not checked the sequence myself. There are approximately half a million texts. The exact number is disputed.

Each message is a single line of text. The format takes some getting used to but basically it is Date (in ISO format), time, service operator, pager number, code(s) that identify the message content/encoding, following by the message itself. The codes vary a little between the operators but are not difficult to unravel.

There is a reddit page linked to the above. Several commentators detail the format. A few ``conversations'' are highlighted there too. Selecting the actual messages by pager number will show these clearly. There's also the usual speculation and conspiracy drizzle.

The actual message content varies. Some are automated status messages about trading systems that have gone offline. Some news reports of other suspicious activities. Some are not in English; I spotted several in Spanish. Some are encoded; a few appear to be (weakly) encrypted; a lot more are quasi-MIME-encoded binary data. Some are in plain-text that should be encrypted; there are messages to pagers, which have been traced back to FBI, US Secret Service and similar agencies, receiving national security intell. Some are personal. Some are from lovers conducting affairs. Some are just plain weird.

Individual messages can appear incomplete and if you're processing the data be careful of singleton quotes. A couple of reddit contributors provided awk scripts to process the messages into CSV format. Another created an SQLite dump. I have not checked that the awk scripts work properly neither have I checked that the SQLite dump still exists or has said content.

Another of the reddit commentators links to their own blog describing how with a radio scanner and some simple hardware it is possible to scrap the airwaves for current pager messages. The ethics of doing so are suspect. And in the UK, where I am, such activity would be in breach of the Telecommunications Act 1949(?) which prohibits the interception of communications. There are similar pages in the blogosphere that document how to intercept SMS messages in a similar fashion. But with the on-going News International debacle over cracking of voicemail messages by journalists at the News of the World cracking mobile phone transmissions for SMS content is probably not a good idea.

(*) The original release was done in real-time with each batch of messages made available at the same clock time as on the day.

As a possible aside, I see that the Singapore SMS corpus, which was also mentioned in reply to Laura's enquiry, includes meta-data on the model of phone being used. Comparing the style of some of those messages with the Wikileaks pager messages I wondered if for the SMS one there was any apparent stylistic difference in the content based solely upon the device being used. There is no such meta-data for the pager messages other than the service provider name.

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list