[Corpora-List] POS-tagger maintenance and improvement

amsler at cs.utexas.edu amsler at cs.utexas.edu
Wed Feb 25 23:42:38 CET 2009


It is worth noting that there are two tasks here. One is the development of better POS taggers, but the other is the creation of correctly tagged freely downloadable text corpora.

The development of better POS taggers is the sort of activity that lends itself to periodic competitive evaluations of POS-taggers in the manner of SIGSEM or NIST-hosted evaluations, (i.e., groups of individuals perfecting their software and periodically trying it out against specifically created training and test corpora whose tagging is done and corrected so it can serve as a gold standard for evaluations).

However, the development of correctly tagged corpora is an activity that could be performed en masse by a large community of web users, in much the manner that Wikipedia has been created by a community. In fact, what seems to suggest itself, is that perhaps a version of Wikipedia (or another body of copyright-free text such as works drawn from Gutenberg) with POS-tagging (or even more ammbitiously, additional tagging for grammatical structure and semantics) could be built and grown to serve the community that needs reliably tagged text.

Both tasks share an underlying problem---what standard tags to use? How to resolve conflicting opinions about whether text is correctly tagged, but the Wikipedia and Project Gutenberg models show us how to enlist a mass of people to manually correct the tags initially supplied by automated systems.

It would be nice if an existing standard corpus could be used, such as the BNC, but I don't see that happening because of copyright issues. However, there is nothing stopping us from creating alternate correctly tagged texts based on Gutenberg works or Wikipedia articles and offering them to those sites as alternative distribution texts.

Everyone keeps asking how the semantic web is going to come into existence. Maybe this is how it starts?

R. Amsler



More information about the Corpora mailing list