[Corpora-List] Cost of POS tagging, again

Kevin B. Cohen kevin.cohen at gmail.com
Wed Dec 27 12:44:00 CET 2006

Hi, Marc et al.,

Christopher's points are well-made. A couple of other things to think

1) You seem to be envisioning doing ex nihilo manual POS annotation.
However, that will probably be neither practical nor desirable; rather,
you're likely to want to do the initial annotation automatically, and then
manually curate the output of the initial, automatically-generated
annotation step.
2) You actually may not want to directly curate the POS tagging at all.
Rather, if you're going to do further processing--say, syntactic
parsing--you might want to curate the POS tags as part or byproduct of the
downstream curation.
3) Even if you do want to directly curate the POS tagging, you will probably
find some efficiencies to be gained from automatic means. For example, you
are more likely to need to correct a bunch of adjective/past participle
distinctions (I'm assuming here that your data is English) than you are to
need to correct a bunch of mis-tagged commas (although I have certainly seen
lots of mis-POS-tagged commas, too!). Scripting can help you out here.

Finally, Christopher is right on with suggesting hourly, rather than
per-token, budgeting.

Hope this is helpful,


K. B. Cohen
Biomedical Text Mining Group Lead
Center for Computational Pharmacology
303-916-2417 (cell) 303-377-9194 (home)

More information about the Corpora-archive mailing list