[Corpora-List] All English Text Messaging Corpus?

Khurshid Ahmad kahmad at scss.tcd.ie
Mon Apr 11 15:43:34 CEST 2011


Dear Laura I am writing to support Rich.

The USPTO documents are in Legal English and are written by Patent Attorneys -these documents form a representative sample of American English and to a lesser extent that of other national varieties of written English. There is a distinction between patent claims and granted patents. To be authentic, I prefer the granted patents as these documents have been reviewed by more than one person. I think the USPTO allows you to make that distinction in their retrieval engine.

Yes there has been a proliferation of such documents and there maybe some laxity by some attornyes in some documents. Large corporations like IBM, Google file these patents and the assignees of the patents include the US Defence Forces. I infer from the named entities on these documents that some care and attention has been paid to the legal arguments which are presented in these documents; and apart from diagrams in the patent documents, we have written English. The whole point of corpus linguistics is that some texts within the collection will comprise outliers of the collection. Literary critics will not allow for outliers, but dictionary makers and information extraction folk love the outliers.

I think it is an excellent idea to be so focussed on building a corpus. the world, and its scholars, are so fixated on news paper and news wire texts that any other variety is seldom considered.

Good luck


> Hi Laura,
>
> I don't know of any text message sources exactly like what your are
> describing. But there is a huge, partially structured text database for
> US
> patent documents, nearly all in English I suppose, which have all been
> critiqued by expert examiners, as edited in the process of negotiating a
> patent claim set - all in English. You can create databases of patent
> documents on your desktop by downloading the free web client software Elk
> for Patents (EfP), which is built on the English Logic Kernel (Elk), as
> described in US Patent 7,209,923. The patent is posted on the web site as
> well. It teaches ways to combine corpus analysis methods with relational
> and object oriented database technologies. See my website to download and
> try the free program.
>
> EnglishLogicKernel dot com
>
> One advantage of choosing the patent database is that every document is
> constrained by the patenting process by experts in each patent's specific
> technologies, and the vocabulary of words defined modus ponens after
> careful
> debate and crafting of each claim sentence. For example, no really
> effective syntax parser for English has reached widespread usage, with the
> best of the performers being the Link Grammar Processor (LGP), IMHO.
> Using
> the vocabulary of non-noise words defined in patent claims, the English
> analyst can relate those claim words and phrases to specific objects as
> they
> have been described by sentences in the much more verbose specification
> part
> of the patent document. This provides an ideal, large, partially
> structured
> database and processing environment in which to analyze the English of
> claim
> language.
>
> HTH,
> -Rich
>
>
> Sincerely,
> Rich Cooper
> EnglishLogicKernel.com
> Rich AT EnglishLogicKernel DOT com
> 9 4 9 \ 5 2 5 - 5 7 1 2
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> Christopherson, Laura
> Sent: Saturday, April 09, 2011 12:35 PM
> To: corpora at uib.no
> Subject: [Corpora-List] All English Text Messaging Corpus?
>
> Hi All,
>
> Do any of you know of a text messaging corpus only in English that is not
> a collection of someone's personal (and/or family/friends') messages?
>
> Thanks,
>



More information about the Corpora mailing list