[Corpora-List] tagging+lemmatizing various languages

Adam Kilgarriff adam.kilgarriff at sketchengine.co.uk
Mon Feb 2 05:15:19 CET 2015



> surely the adequacy of the training data must be as important a factor as
the algorithm.

Absolutely! Well said, Ciarán. I often think, if NLP people put as much effort into the training data as the algorithms, our systems would perform much better.

There's so much more I could say on the subject (on the problem of language-specific work in Computer Science, on what's open and what's proprietary, on generalising across text types, on statistical vs rule-based systems and in-built biases in favour or statistical ones, when we compare performance figures) - but the post would quickly get much too long!

Adam

On 1 February 2015 at 17:26, Ciarán Ó Duibhín < ciaran at oduibhin.freeserve.co.uk> wrote:


> Thanks to Alexandr for a very interesting survey.
>
> Tagging programs (algorithms) for a given language are often compared, but
> surely the adequacy of the training data must be as important a factor as
> the algorithm. As far as I can see, most available models for English are
> trained on the Wall Street Journal, which is a rather restrictive domain.
> I use such a model myself when tagging "general" English text, and
> unsurprisingly it mistags senses which would have been rare in the training
> domain.
>
> Take as a simple example "The dog bit the postman." A WSJ-trained model
> is likely to tag "bit" as a noun. Likewise, words such as "round" or
> "crude" are likely to be tagged as nouns rather than the more common (in
> general text) adjective.
>
> The tagger I use, and its supplied WSJ model, have been around for years,
> and it surprises me that little effort is being made (that I am aware of)
> to improve the model. I regularly see this tagger being used on large
> corpora of English, and while the training model is not mentioned I would
> assume it is the supplied WSJ model, and I doubt whether extensive manual
> post-editing follows.
>
> Comments anyone? Perhaps there are taggers which have a more
> "domain-independent" model of English than that provided by WSJ? Is there
> a survey of this aspect?
>
> Ciarán Ó Duibhín.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

-- ============================================= Adam Kilgarriff <http://www.kilgarriff.co.uk/> adam at sketchengine.co.uk Director Lexical Computing Ltd <http://www.sketchengine.co.uk/> Visiting Research Fellow University of Leeds <http://leeds.ac.uk/> Blog <http://blog.kilgarriff.co.uk> at blog.kilgarriff.co.uk *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk/>

and SKELL <http://skell.sketchengine.co.uk/> ============================================= -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5213 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150202/b4bd7db5/attachment.txt>



More information about the Corpora mailing list