[Corpora-List] POS-tagger maintenance and improvement

Jana Diesner janadiesner at gmx.net
Wed Feb 25 17:26:46 CET 2009


Dear Adam,

We did a systematic study on the impact of various variables (the technical decisions that one has to make when implementing a POS tagger) on POS tagging accuracy.

The report might provide some more detailed information on possible error sources, respective loss or gain of accuracy, and addresses difficulties in doing an error analysis with systematic rigor.

URL for the report: http://reports-archive.adm.cs.cmu.edu/anon/isr2008/CMU-ISR-08-131R.pdf

Best regards, Jana

Jana Diesner

Carnegie Mellon University

School of Computer Science

Center for Computational Analysis of Social and Organizational Systems

Web: http://www.andrew.cmu.edu/user/jdiesner/

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Adam Kilgarriff Sent: Wednesday, February 25, 2009 6:16 AM To: Corpora List Cc: Sue Atkins; Valerie GRUNDY; Patrick Hanks Subject: [Corpora-List] POS-tagger maintenance and improvement

All,

My lexicography colleagues and I use POS-tagged corpora all the time, every day, and very frequently spot systematic errors. (This is for a range of languages, but particularly English.) We would dearly like to be in a dialogue with the developers of the POS-tagger and/or the relevant language models so the tagger+model could be improved in response to our feedback. (We have been using standard models rather than training our own.) However it seems, for the taggers and language models we use (mainly TreeTagger, also CLAWS) and also for other market leaders, all of which seem to be from Universities, the developers have little motivation for continuing the improvement of their tagger, since incremental improvements do not make for good research papers, so there is nowhere for our feedback to go, nor any real prospect of these taggers/models improving.

Am I too pessimistic? Are there ways of improving language models other than developing bigger and better training corpora - not an exercise we have the resources to invest in? Are there commercial taggers I should be considering (as, in the commercial world, there is motivation for incremental improvements and responding to customer feedback)?

Responses and ideas most welcome

Adam Kilgarriff -- ================================================ Adam Kilgarriff http://www.kilgarriff.co.uk Lexical Computing Ltd http://www.sketchengine.co.uk Lexicography MasterClass Ltd http://www.lexmasterclass.com Universities of Leeds and Sussex adam at lexmasterclass.com ================================================

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7015 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090225/4383c400/attachment.txt



More information about the Corpora mailing list