[Corpora-List] tagging+lemmatizing various languages

Andrew Caines cainesap at gmail.com
Mon Feb 2 09:07:07 CET 2015


Some e.g.s of previous attempts to adapt taggers to specific domains: biomedical (Rimell & Clark 2009 <http://dl.acm.org/citation.cfm?id=1628462>), Twitter (Plank, Hovy, McDonald & Søgaard 2014 <http://www.aclweb.org/anthology/C/C14/C14-1168.pdf>), learners (Cahill, Gyawali & Bruno 2014 <http://www.aclweb.org/anthology/W/W14/W14-6106.pdf>). A wider survey would be useful here: contributions from the list?

As for new training data, Chelba et al <http://arxiv.org/abs/1312.3005>'s arXiv paper may be of interest: 'One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling'; along with work by the Web As Corpus SIG: http://www.aclweb.org/anthology/sigwac.html

But as Adam says, many more issues here, including Anglo-centricism (note the Universal Dependencies <http://universaldependencies.github.io/docs/> project), representativeness / sample bias (note: Temnikova et al 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/675_Paper.pdf>, Hovy, Plank & Søgaard 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/476_Paper.pdf>), and so on.

Andrew

On 2 February 2015 at 04:15, Adam Kilgarriff < adam.kilgarriff at sketchengine.co.uk> wrote:


> > surely the adequacy of the training data must be as important a factor
> as the algorithm.
>
> Absolutely! Well said, Ciarán. I often think, if NLP people put as much
> effort into the training data as the algorithms, our systems would perform
> much better.
>
> There's so much more I could say on the subject (on the problem of
> language-specific work in Computer Science, on what's open and what's
> proprietary, on generalising across text types, on statistical vs
> rule-based systems and in-built biases in favour or statistical ones, when
> we compare performance figures) - but the post would quickly get much too
> long!
>
> Adam
>
>
> On 1 February 2015 at 17:26, Ciarán Ó Duibhín <
> ciaran at oduibhin.freeserve.co.uk> wrote:
>
>> Thanks to Alexandr for a very interesting survey.
>>
>> Tagging programs (algorithms) for a given language are often compared,
>> but surely the adequacy of the training data must be as important a factor
>> as the algorithm. As far as I can see, most available models for English
>> are trained on the Wall Street Journal, which is a rather restrictive
>> domain. I use such a model myself when tagging "general" English text, and
>> unsurprisingly it mistags senses which would have been rare in the training
>> domain.
>>
>> Take as a simple example "The dog bit the postman." A WSJ-trained model
>> is likely to tag "bit" as a noun. Likewise, words such as "round" or
>> "crude" are likely to be tagged as nouns rather than the more common (in
>> general text) adjective.
>>
>> The tagger I use, and its supplied WSJ model, have been around for years,
>> and it surprises me that little effort is being made (that I am aware of)
>> to improve the model. I regularly see this tagger being used on large
>> corpora of English, and while the training model is not mentioned I would
>> assume it is the supplied WSJ model, and I doubt whether extensive manual
>> post-editing follows.
>>
>> Comments anyone? Perhaps there are taggers which have a more
>> "domain-independent" model of English than that provided by WSJ? Is there
>> a survey of this aspect?
>>
>> Ciarán Ó Duibhín.
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> =============================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at sketchengine.co.uk
> Director Lexical Computing Ltd
> <http://www.sketchengine.co.uk/>
> Visiting Research Fellow University of Leeds
> <http://leeds.ac.uk/>
> Blog <http://blog.kilgarriff.co.uk> at blog.kilgarriff.co.uk
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk/>
> and SKELL <http://skell.sketchengine.co.uk/>
> =============================================
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7613 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/dde56b92/attachment.txt>



More information about the Corpora mailing list