[Corpora-List] tagging+lemmatizing various languages

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Sun Feb 1 18:26:58 CET 2015

Thanks to Alexandr for a very interesting survey.

Tagging programs (algorithms) for a given language are often compared, but surely the adequacy of the training data must be as important a factor as the algorithm. As far as I can see, most available models for English are trained on the Wall Street Journal, which is a rather restrictive domain. I use such a model myself when tagging "general" English text, and unsurprisingly it mistags senses which would have been rare in the training domain.

Take as a simple example "The dog bit the postman." A WSJ-trained model is likely to tag "bit" as a noun. Likewise, words such as "round" or "crude" are likely to be tagged as nouns rather than the more common (in general text) adjective.

The tagger I use, and its supplied WSJ model, have been around for years, and it surprises me that little effort is being made (that I am aware of) to improve the model. I regularly see this tagger being used on large corpora of English, and while the training model is not mentioned I would assume it is the supplied WSJ model, and I doubt whether extensive manual post-editing follows.

Comments anyone? Perhaps there are taggers which have a more "domain-independent" model of English than that provided by WSJ? Is there a survey of this aspect?

Ciarán Ó Duibhín. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2202 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150201/4ea7b1f3/attachment.txt>

More information about the Corpora mailing list