[Corpora-List] tagging+lemmatizing various languages

Alexandr Rosen alexandr.rosen at gmail.com
Tue Feb 3 21:38:11 CET 2015


Thanks a lot for comments and suggestions!

Actually, each cell with a tick in the summary table at https://wiki.korpus.cz/doku.php/seznamy:tagery should include info on the size and domain of the training data. But languages with rich inflection require more data, or the tagger must include a morphological component and lexicon. Then the size of the lexicon matters, too. Maybe at least some cells could be filled in.

Alexandr Rosen

---

Date: Sun, 1 Feb 2015 17:26:58 -0000 From: Ciarán Ó Duibhín <ciaran at oduibhin.freeserve.co.uk> Subject: Re: [Corpora-List] tagging+lemmatizing various languages To: <corpora at uib.no>

Thanks to Alexandr for a very interesting survey.

Tagging programs (algorithms) for a given language are often compared, but surely the adequacy of the training data must be as important a factor as the algorithm. As far as I can see, most available models for English are trained on the Wall Street Journal, which is a rather restrictive domain. I use such a model myself when tagging "general" English text, and unsurprisingly it mistags senses which would have been rare in the training domain.

Take as a simple example "The dog bit the postman." A WSJ-trained model is likely to tag "bit" as a noun. Likewise, words such as "round" or "crude" are likely to be tagged as nouns rather than the more common (in general text) adjective.

The tagger I use, and its supplied WSJ model, have been around for years, and it surprises me that little effort is being made (that I am aware of) to improve the model. I regularly see this tagger being used on large corpora of English, and while the training model is not mentioned I would assume it is the supplied WSJ model, and I doubt whether extensive manual post-editing follows.

Comments anyone? Perhaps there are taggers which have a more "domain-independent" model of English than that provided by WSJ? Is there a survey of this aspect?

Ciarán Ó Duibhín.

---

Date: Mon, 2 Feb 2015 04:15:19 +0000 From: Adam Kilgarriff <adam.kilgarriff at sketchengine.co.uk> Subject: Re: [Corpora-List] tagging+lemmatizing various languages To: Ciarán Ó Duibhín <ciaran at oduibhin.freeserve.co.uk> Cc: "corpora at uib.no" <corpora at uib.no>


> surely the adequacy of the training data must be as important a factor as
the algorithm.

Absolutely! Well said, Ciarán. I often think, if NLP people put as much effort into the training data as the algorithms, our systems would perform much better.

There's so much more I could say on the subject (on the problem of language-specific work in Computer Science, on what's open and what's proprietary, on generalising across text types, on statistical vs rule-based systems and in-built biases in favour or statistical ones, when we compare performance figures) - but the post would quickly get much too long!

Adam

---

Date: Mon, 2 Feb 2015 08:07:07 +0000 From: Andrew Caines <cainesap at gmail.com> Subject: Re: [Corpora-List] tagging+lemmatizing various languages To: Adam Kilgarriff <adam.kilgarriff at sketchengine.co.uk> Cc: "corpora at uib.no" <corpora at uib.no>

Some e.g.s of previous attempts to adapt taggers to specific domains: biomedical (Rimell & Clark 2009 <http://dl.acm.org/citation.cfm?id=1628462>), Twitter (Plank, Hovy, McDonald & Søgaard 2014 <http://www.aclweb.org/anthology/C/C14/C14-1168.pdf>), learners (Cahill, Gyawali & Bruno 2014 <http://www.aclweb.org/anthology/W/W14/W14-6106.pdf>). A wider survey would be useful here: contributions from the list?

As for new training data, Chelba et al <http://arxiv.org/abs/1312.3005>'s arXiv paper may be of interest: 'One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling'; along with work by the Web As Corpus SIG: http://www.aclweb.org/anthology/sigwac.html

But as Adam says, many more issues here, including Anglo-centricism (note the Universal Dependencies <http://universaldependencies.github.io/docs/> project), representativeness / sample bias (note: Temnikova et al 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/675_Paper.pdf>, Hovy, Plank & Søgaard 2014 <http://www.lrec-conf.org/proceedings/lrec2014/pdf/476_Paper.pdf>), and so on.

Andrew -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5972 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150203/8a638767/attachment.txt>



More information about the Corpora mailing list