With all these pun(n)y red herrings out of the way (thank you, Ramesh!), we can now perhaps return to this thread's topic: the building of shareable corpora.
This has allowed SoNaR to draw up a licence agreement with the maintainers of a single internet forum which alone has yielded text far in excess of the envisaged 500 million words for the whole Dutch reference corpus.
We are nevertheless trying to build a balanced corpus of almost 40 different text types, from books to tweets, from two countries (each with its own copyright legislation).
This week in a posting on this list from Singapore, we have heard about interesting approaches to collecting SMS.
I am very interested in hearing about other viable, preferably large-scale, approaches to collecting IPR-settled text, regardless of text type, for the purposes of building shareable corpora from other members of this list.
On 4/16/11 1:42 PM, Trevor Jenkins wrote:
> Hi Ramesh,
>>> 1. "I mung headers.... I know that as a result of my munging any
>>> mesasge I write will go to the list (and only to the list). And then I
>>> *only* have to consider the few instances where I would need to reply
>>> off list."
>> I tried to find the meaning of 'mung'via google - 'mash until no good' =
>> destroy? So I can't understand your use of it. If you destroy headers,
>> how does that ensure you only reply to the list?
> For some email consultants munging is considered bad. Chip Rosenthal wrote
> an essay entitled ``Munging Headers Considered Harmful'' (*) there is a
> copy at http://marc.merlins.org/netrants/reply-to-harmful.html (or so
> Google tells me).
> I happen to disagree vehemently with Rosenthal and his ilk. Setting
> Reply-To: to be the list is the only proper way to setup up a mailing
> list. If the owners won't do then I do with a procmail recipe.
> (*) This is an in-joke for computing scientists relating back to Edsger
> Dijkstra's (in)famous March 1968 letter to Comm of the ACM entitled ``GOTO
> Considered Harmful''.
>>> 2. "Except that you have actually donated your on-list replies to a
>>> collection that is not under the control of the list owners. It can be
>>> scrapped by anyone with a mind to. This list (and many others) is
>>> being mirrored on gmane.org."
> Interesting that when I searched for ``mung headers'' to find the
> Rosenthal paper the first thing that Google found was my post to CORPORA-L
> scraped to gmane.org.
>> What does 'scrapped' mean in this context?
> You're asking a dyslexic what a spelling means. ;-) Ah Google's
> define: tells me it should be ``scraped''.
> Ironic isn't it that a dyslexic is passionate about corpus studies. And a
> dyslexic who used to work on the world's best text retrieval system (Trip
> originally from paralog). ;-)
>>> 3. By the way, I have long been intrigued by your strapline (if that's
>>> the appropriate term):
>> "<>< Re: deemed!"
> Interestingly in the 20+ years I've been emailing, you (Ramesh) are only
> the fourth or fifth person ever to comment. And of them only the
> second/third to ask what it means. Two replied ``me too.'' ;-)
> Two parts:<>< is an ASCII art representation of the 1st century symbol
> ICTHUS used within the Christian church to identify themselves to other
> believers. One wonders what a corpus of ASCII art would look like and how
> would it be analysed?
> ``Re: deemed!'' is a pun on the word redeemed. Also makes for an
> interesting question of corpus analysis.
> Regards, Trevor
> <>< Re: deemed!
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no