[Corpora-List] Call for SMS Contributi​on for a Public Research Corpus

Tao Chen taochen at comp.nus.edu.sg
Tue Apr 5 15:53:06 CEST 2011


Dear members of the corpora community:

We are seeking your help to enlarge a freely available public corpus of SMS messages. In the last few months, at the National University of Singapore (NUS), we have been working to collect a live corpus of SMS (Short Message Service) messages. In fact, previously in 2004, we had made a corpus of messages (~10,000 messages in English, mostly from Singaporeans) available to the public for study.

We have restarted the 2004 project since last October, aiming at enlarging the corpus both in depth and breadth. We are collecting better demographic information, timestamps, recipient and sender identity (appropriately anonymized) and including this with the corpus' messages. Up to now, we have collected over 21,000 new English messages and 10,000 Chinese messages. Most messages are tagged with metadata about the sender's profile (gender, age, country, years of using SMS, number of SMS sent daily, etc.). The corpus is being versioned and released on a monthly basis, and is free for all communities to use. New releases are made on a monthly basis, since the corpus collection process is live and the corpus is growing. For detailed information about our corpus, please visit our NUS SMS Corpus site at: http://wing.comp.nus.edu.sg:8080/SMSCorpus.

We write this email to seek your help, either directly or indirectly, to ask for your contribution to build this public resource. SMS messages still continue to be a vital, sensitive and important vehicle for personal communication which many of us use on a daily basis. Up to now, scholars do not have access to a large, freely available SMS corpus to study and most research on SMS has been done with collaboration with private companies who have strict non-disclosure agreements, making comparative SMS research impossible.

As SMS are potentially sensitive and identity-revealing, our collection framework tries to anonymize sensitive data in messages, such as telephone numbers, email addresses and other identifiers, before accepting them into the corpus. This is a legitimate attempt to collect and enlarge an SMS corpus for the public good, and if you are concerned about the legitimacy of our project, please visit our webpage

first. Additionally, this study has been exempted from NUS' institutional review board (IRB) panel for human studies protocols.

Such a public corpus needs your contribution, as most of us are senders of SMS. With a larger base of contributors and a growing number of messages archived, the corpus will grow in depth and utility to scholars everywhere.

Currently, there are three methods for you to contribute SMS messages to the public corpus. Please refer to the "Contribution" page from our project page at http://wing.comp.nus.edu.sg:8080/SMSCorpus/ for detailed information. We summarize them below.

* Android phone owners - Please install our App "SMS Collection for Corpus" from the Android market (authored by Web IR/ NLP Group @ NUS). Follow the app's instructions to submit SMS to us. The software will create a draft message with your SMSes to send to us; you will have a chance to censor or delete messages that you do not want to contribute.

* Nokia phone owner - Please use Nokia PC Suite to export SMS as a CSV file. The PC Suite software is available from our project page. Then send the file to SMS.Donation at gmail.com.

* Other brand phone owner - You can type your messages in the contribution site's web page. Or export your SMS as a file(eg. CSV file) if you know some software can help you do so, then sent the file to SMS.Donation at gmail.com.

(We currently do not have an automated donation method for the iPhone, sorry!)

If you have any questions or suggestions, please feel free to contact me. We sincerely appreciate your suggestions and contributions!

-- Tao CHEN

PhD Candidate Web IR / NLP Group (WING), School of Computing National University of Singapore -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5466 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20110405/ab87a4b5/attachment.txt>



More information about the Corpora mailing list