[Corpora-List] User generated content corpora

Adam Kilgarriff adam.kilgarriff at sketchengine.co.uk
Thu Mar 5 10:50:46 CET 2015


I'm reminded of the quip about Natural Language Generation:

"Natural language analysis is like counting from 1 to infinity. The trouble with Natural Language Generation is that it is like counting from infinity to 1." (which I first heard from Yorick Wilks, but he had heard it from someone else.)

It's like the 'normalisation' discussion because in both cases we dream of a 'true message' that, in the NLG case, we generate from, and in the normalisation and analysis cases, we normalise/analyse to. Of course the 'true message' is a fantasy, and what we think it should be will depend on our particular goals in our particular context

Adam

On 5 March 2015 at 08:58, Min-Yen Kan <knmnyn at gmail.com> wrote:


> Hi Rob, all:
>
> You found the SMS corpus that I collected, and that others have added
> normalization and correction for. For user generated content in
> student essay form, there's also the NUCLE corpus, collected by my
> colleague at NUS, Prof. Hwee Tou Ng. It's described in the following
> paper:
>
> http://www.aclweb.org/anthology/W13-1703
>
> To be a bit more constructive, I've supplied the abstract below:
>
> We describe the NUS Corpus of Learner English (NUCLE), a large, fully
> annotated corpus of learner English that is freely available for
> research purposes. The goal of the corpus is to provide a large data
> resource for the development and evaluation of grammatical error
> correction systems. Although NUCLE has been available for almost two
> years, there has been no reference paper that describes the corpus in
> detail. In this paper, we address this need. We describe the
> annotation schema and the data collection and annotation process of
> NUCLE. Most importantly, we report on an unpublished study of
> annotator agreement for grammatical error correction. Finally, we
> present statistics on the distribution of grammatical errors in the
> NUCLE corpus.
>
> Cheers,
>
> Min
>
> --
> Min-Yen KAN (Dr) :: Associate Professor :: National University of
> Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive
> Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) ::
> kanmy at comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)
>
>
> On Thu, Mar 5, 2015 at 4:37 PM, Grzegorz Chrupała <pitekus at gmail.com>
> wrote:
> > On Thu, Mar 5, 2015 at 12:20 AM, Jacob Eisenstein <jacobe at gmail.com>
> wrote:
> >>> - Lexnorm: fairly small twitter corpus, but includes corrections.
> >>
> >
> >> It's open question whether the whole exercise of normalization really
> makes
> >> sense for these sorts of terms, since the meaning of the normalized
> version
> >> is often quite different from the original: you can't use the phrase
> "shake
> >> my head" in the same linguistic contexts where you can use "smh", even
> if
> >> the latter is in some sense an abbreviation for the former.
> >
> > This is a very good point, and for this reason I suggest that we think
> > of normalization as an annotation layer on top of the original text,
> > rather than simply replacing it. To a lesser degree the same applies
> > to word and sentence segmentation: I think it is unfortunate that in
> > NLP there is a tradition of working with *destructively* preprocessed
> > corpora, where it is impossible to recover the intact original text.
> >
> > --
> > Grzegorz Chrupała
> > Communication and Information Sciences
> > Tilburg University
> > PO Box 90153
> > 5000 LE Tilburg
> > The Netherlands
> >
> > Web: grzegorz.chrupala.me
> > Phone: +31 13 466 3106
> > Email: g.chrupala at uvt.nl
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- ============================================= Adam Kilgarriff <http://www.kilgarriff.co.uk/> adam at sketchengine.co.uk Director Lexical Computing Ltd <http://www.sketchengine.co.uk/> Visiting Research Fellow University of Leeds <http://leeds.ac.uk/> Blog <http://blog.kilgarriff.co.uk> at blog.kilgarriff.co.uk *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk/>

and SKELL <http://skell.sketchengine.co.uk/> ============================================= -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6860 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150305/cbaa4601/attachment.txt>



More information about the Corpora mailing list