[Corpora-List] Community-driven corpus building

Martin Reynaert reynaert at uvt.nl
Sat Apr 16 16:24:48 CEST 2011

Dear List,

With all these pun(n)y red herrings out of the way (thank you, Ramesh!), we can now perhaps return to this thread's topic: the building of shareable corpora.

Shareability implies IPR-settlement. The implicit consent to redistribution of their 'donated' texts by users of sites where the maintainers have the necessary statements in their sites' terms of use allow the corpus builder to deal with a single instance rather than with potentially thousand of unreachable individuals.

This has allowed SoNaR to draw up a licence agreement with the maintainers of a single internet forum which alone has yielded text far in excess of the envisaged 500 million words for the whole Dutch reference corpus.

We are nevertheless trying to build a balanced corpus of almost 40 different text types, from books to tweets, from two countries (each with its own copyright legislation).

This week in a posting on this list from Singapore, we have heard about interesting approaches to collecting SMS.

I am very interested in hearing about other viable, preferably large-scale, approaches to collecting IPR-settled text, regardless of text type, for the purposes of building shareable corpora from other members of this list.

Thank you,


On 4/16/11 1:42 PM, Trevor Jenkins wrote:
> Hi Ramesh,
>>> 1. "I mung headers.... I know that as a result of my munging any
>>> mesasge I write will go to the list (and only to the list). And then I
>>> *only* have to consider the few instances where I would need to reply
>>> off list."
>> I tried to find the meaning of 'mung'via google - 'mash until no good' =
>> destroy? So I can't understand your use of it. If you destroy headers,
>> how does that ensure you only reply to the list?
> For some email consultants munging is considered bad. Chip Rosenthal wrote
> an essay entitled ``Munging Headers Considered Harmful'' (*) there is a
> copy at http://marc.merlins.org/netrants/reply-to-harmful.html (or so
> Google tells me).
> I happen to disagree vehemently with Rosenthal and his ilk. Setting
> Reply-To: to be the list is the only proper way to setup up a mailing
> list. If the owners won't do then I do with a procmail recipe.
> (*) This is an in-joke for computing scientists relating back to Edsger
> Dijkstra's (in)famous March 1968 letter to Comm of the ACM entitled ``GOTO
> Considered Harmful''.
>>> 2. "Except that you have actually donated your on-list replies to a
>>> collection that is not under the control of the list owners. It can be
>>> scrapped by anyone with a mind to. This list (and many others) is
>>> being mirrored on gmane.org."
> Interesting that when I searched for ``mung headers'' to find the
> Rosenthal paper the first thing that Google found was my post to CORPORA-L
> scraped to gmane.org.
>> What does 'scrapped' mean in this context?
> You're asking a dyslexic what a spelling means. ;-) Ah Google's
> define: tells me it should be ``scraped''.
> Ironic isn't it that a dyslexic is passionate about corpus studies. And a
> dyslexic who used to work on the world's best text retrieval system (Trip
> originally from paralog). ;-)
>>> 3. By the way, I have long been intrigued by your strapline (if that's
>>> the appropriate term):
>> "<>< Re: deemed!"
> Interestingly in the 20+ years I've been emailing, you (Ramesh) are only
> the fourth or fifth person ever to comment. And of them only the
> second/third to ask what it means. Two replied ``me too.'' ;-)
> Two parts:<>< is an ASCII art representation of the 1st century symbol
> ICTHUS used within the Christian church to identify themselves to other
> believers. One wonders what a corpus of ASCII art would look like and how
> would it be analysed?
> ``Re: deemed!'' is a pun on the word redeemed. Also makes for an
> interesting question of corpus analysis.
> Regards, Trevor
> <>< Re: deemed!
