[Corpora-List] POS-tagger maintenance and improvement

Brett Reynolds brett at forsyths.ca
Wed Feb 25 17:20:22 CET 2009

On 25-Feb-09, at 8:48 AM, Eric Atwell wrote:
> As others have commented, TreeTagger models for other languages are
> also derived from a PoS-tagged corpus, whcih suggest the only way
> to eradicate systematic errors is to "correct" the tagging in the
> training
> corpus

I'm a language teacher who dabbles in a variety of things including linguistics and corpora. I'm an autodidact and don't have any great expertise in any of these fields, so the following may be completely obvious to everyone here, or it may be way off the mark:

It seems to me that an inconsistent grammatical description will lead to inconsistent hand tagging, which, when used to train software, will lead to inconsistent taggers. The more rigorous our grammar is, the better our taggers will perform.

To take an English example, there was a recent paper in Language Learning that referred to "last Sunday" as an adverb in the sentence "He painted his house last Sunday." This confuses the function of the NP (modifier) with the category (NP). If this is the kind of input the software has for training, well, GIGO.

English has the benefit of an analysis like the Cambridge Grammar of the English Language which may not be a perfectly accurate description of English, but seems to me head and shoulders above any comprehensive grammar published about English. I imagine that an English POS tagger trained on CGEL-based tagsets would immediately outperform those based on other grammars. I'm not familiar with comprehensive grammars of other languages, but I'd guess they are plagued with inconsistencies.

For all languages, formal linguists, corpus linguists, corpus builders, and software developers do need to be in constant interaction. An open source project would seem a good way to facilitate this, but how do we make sure there's the payback in terms of academic credentials/publishing credit? (An interesting tangentially-related discussion is here: <http://worthwhile.typepad.com/worthwhile_canadian_initi/2009/02/ economics-blogging-and-academia.html>)

Best, Brett


----------------------- Brett Reynolds English Language Centre Humber College Institute of Technology and Advanced Learning Toronto, Ontario, Canada brett.reynolds at humber.ca

More information about the Corpora mailing list