[Corpora-List] text message corpus - clarification

Trevor Jenkins trevor.jenkins at suneidesis.com
Tue Apr 12 01:22:09 CEST 2011


On Mon, 11 Apr 2011, Christopherson, Laura <llchrist at email.unc.edu> wrote:


> Regarding the "personal" idea, absolutely yes - ultimately each message is
> personal to someone. I'm more interested in text messages that are not a
> collection of messages which are personal **to the collector** - i.e. not
> the collector's own messages to/from his family/friends or messages that
> are created by only the collector's family/friends. For instance, Caroline
> Tagg has an awesome corpus of SMS messages; but with the exception of a
> small subset of that corpus, all messages are from people she knows
> personally (family/friends). ...

But if you had enough of these individual collections to work with the implicit bias would disappear presuming little overlap between the senders and receipients. Defining ``enough'' will be hard. Would 20, 200, 2,000, 20,000 different collections be sufficient?

And then there's the demographics of the people. Amongst friends who text me I see a variety of styles based solely on demographics. Older senders are more likely to write messages, younger ones to use l33t or txt spk abbreviations. Messages to my phone also have very different content depending upon whether the originator is Deaf or not. (I work as a community sign language interpreter.) The Deaf senders tend toward brevity but without using l33t/txt spk conventions; the hearing senders will play with homophonic abbreviations like HOW R U? and C U L8R.


> Susan's corpus and Trevor's suggestion of wikileaks may be right on target
> for me if what I've hopefully clarified gels with your (Susan and
> Trevor's) understanding of these text messages.

The WikiLeaks collection is variously described as SMS or pager messages. Both are short messaging systems but a major difference would be that SMS could be bi-directional but pager uni-directional. Reconstructing SMS dialogues might be difficult unless you had a stringent collection protocol.

You may have to search for the WikiLeaks material as a) their website gets overloaded and b) the pager material is embedded down several links and, c) gets hit with DDoS attacks by disaffected and disgruntled objectors to their activity.

Regards, Trevor

<>< Re: deemed!



More information about the Corpora mailing list