[Corpora-List] text message corpus - clarification

Christopherson, Laura llchrist at email.unc.edu
Mon Apr 11 20:41:36 CEST 2011

Hi All,

I am responding to the requests for clarification about my earlier request for an English-only text messaging corpus. Thanks so much for reading and responding. I definitely need to be more specific!

A couple of points were raised about the notions of "text messages" and "personal." I will try to clarify these points.

When I used the term "text messages," I meant it in a specific way (not a general usage of "things/documents/files in text"). Specifically, I meant SMS (short messaging service) as Benjamin indicated - messages created on cellphones via a service provider's (like AT&T) service for this sort of communication.

Regarding the "personal" idea, absolutely yes - ultimately each message is personal to someone. I'm more interested in text messages that are not a collection of messages which are personal **to the collector** - i.e. not the collector's own messages to/from his family/friends or messages that are created by only the collector's family/friends. For instance, Caroline Tagg has an awesome corpus of SMS messages; but with the exception of a small subset of that corpus, all messages are from people she knows personally (family/friends). On an opposite tack, there is the NUS SMS corpus that was recommended by John. As I understand this, the situation under which this corpus was created was one where students (not necessarily personal friends/family of the collector) submitted messages to the collector. So I consider this "non-personal." (Does this make sense?)

While the NUS SMS corpus satisfies the "non-personal" requirement, it doesn't satisfy the English-only requirement. I had originally intended to use this but when I got into it, I realized I could not because there is so much code-switching, even within a single message. I don't speak any of the languages in Singapore and would be at a loss to make solid distinctions between Netspeak (see David Crystal: Language and the Internet) terms in English, Netspeak terms in some other language, or non-Netspeak terms in a non-English language. Sigh - because it too is a wonderful corpus.

Susan's corpus and Trevor's suggestion of wikileaks may be right on target for me if what I've hopefully clarified gels with your (Susan and Trevor's) understanding of these text messages.

I really appreciate your help with this!


