[Corpora-List] POS-tagger maintenance and improvement

WHITELOCK, Pete pete.whitelock at oup.com
Thu Feb 26 02:04:33 CET 2009


Hi Brett,

There are obvious limitations in trying to shoehorn the behaviour of words and phrases into the straitjacket of a single atomic symbol. What should be obvious though is that for the purposes of tagging such atomic symbols should reflect the distributional characteristics of the items they label and not anything else such as their function or morphology, because distributions are what taggers are trained on and what they are intended to classify. In this regard, the labeling of "last Sunday" as an adverb seems eminently sensible, since its distribution is precisely that of a (temporal) adverb rather than that of an arbitrary noun phrase. We wouldn't want to consider "paint" as a ditransitive verb in the sentence "He painted his house last Sunday". I would expect a tagger that assigned "last Sunday" the same tag as "yesterday" to out-perform one that called it an adjective-noun sequence.

Pete Whitelock Data and Resources Development Manager Reference Department Academic Division Oxford University Press

On 25-Feb-09, at 8:48 AM, Eric Atwell wrote:
> As others have commented, TreeTagger models for other languages are
> also derived from a PoS-tagged corpus, whcih suggest the only way to
> eradicate systematic errors is to "correct" the tagging in the
> training corpus

I'm a language teacher who dabbles in a variety of things including linguistics and corpora. I'm an autodidact and don't have any great expertise in any of these fields, so the following may be completely obvious to everyone here, or it may be way off the mark:

It seems to me that an inconsistent grammatical description will lead to inconsistent hand tagging, which, when used to train software, will lead to inconsistent taggers. The more rigorous our grammar is, the better our taggers will perform.

To take an English example, there was a recent paper in Language Learning that referred to "last Sunday" as an adverb in the sentence "He painted his house last Sunday." This confuses the function of the NP (modifier) with the category (NP). If this is the kind of input the software has for training, well, GIGO.

English has the benefit of an analysis like the Cambridge Grammar of the English Language which may not be a perfectly accurate description of English, but seems to me head and shoulders above any comprehensive grammar published about English. I imagine that an English POS tagger trained on CGEL-based tagsets would immediately outperform those based on other grammars. I'm not familiar with comprehensive grammars of other languages, but I'd guess they are plagued with inconsistencies.

For all languages, formal linguists, corpus linguists, corpus builders, and software developers do need to be in constant interaction. An open source project would seem a good way to facilitate this, but how do we make sure there's the payback in terms of academic credentials/publishing credit? (An interesting tangentially-related discussion is here: <http://worthwhile.typepad.com/worthwhile_canadian_initi/2009/02/ economics-blogging-and-academia.html>)

Best, Brett

<http://english-jack.blogspot.com>

----------------------- Brett Reynolds English Language Centre Humber College Institute of Technology and Advanced Learning Toronto, Ontario, Canada brett.reynolds at humber.ca

_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora Oxford University Press (UK) Disclaimer

This message is confidential. You should not copy it or disclose its contents to anyone. You may use and apply the information for the intended purpose only. OUP does not accept legal responsibility for the contents of this message. Any views or opinions presented are those of the author only and not of OUP. If this email has come to you in error, please delete it, along with any attachments. Please note that OUP may intercept incoming and outgoing email communications.



More information about the Corpora mailing list