[Corpora-List] Cost of part of speech tagging

Christopher Walker chwalker at ldc.upenn.edu
Tue Dec 26 17:22:00 CET 2006


Hi,


| Hello all! Does anyone have any thoughts on what the cost of annotating a

| corpus with part of speech tags is? For example, would you pay someone per

| word or per sentence and how much? Any other thoughts or information on

| corpus preparation and financial cost would be very helpful. Thanks for

| advance for any thoughts.

|


This depends on a number of dimensions, including:

* The type of data being tagged (i.e. news or poetry);
* The quality of the tokenization provided;
* The narrow-tailoring of the annotation tool;
* The size of the tagset (i.e. the number of distinct tags);
* The education/training of the annotator;
* The availability of native speakers in the language;

The pay rate can vary widely, but $8-15/hour is typical. For a news
dataset of 100K words and a tagset in the range of 15 distinct tags,
this amounts to about 250 words/hour with the right tool -- or 400
hours of native speaker effort. In other words: $3200-6000 for labor
... plus overhead, data formatting, training, supervision and quality
control.

You may be able to pay by the word, but I don't have any experience
with this approach. Assuming $0.20/word (slightly less than the
standard going rate for a typical translation task), the same job
would cost $20,000 -- and would probably still generate additional
overhead, data formatting, training and quality control costs.

LDC has some (open source, web-based) tools for POS annotation, but
are still in the process of making those publicly available. Please
let me know if you're interested and I'll try to put you in touch
with the right people.

-Christopher.

On Sat, Dec 23, 2006 at 08:39:06PM -0700, Marc Carmen wrote:

| Hello all! Does anyone have any thoughts on what the cost of annotating a

| corpus with part of speech tags is? For example, would you pay someone per

| word or per sentence and how much? Any other thoughts or information on

| corpus preparation and financial cost would be very helpful. Thanks for

| advance for any thoughts.

|

| --

| Thanks,

| Marc Carmen

| marc.carmen at gmail.com


--

---------------------------------------
Christopher R. Walker, Project Manager
Automatic Content Extraction (ACE) &
Less-Commonly Taught Languages (LCTL)
LDC Annotation Lab
chwalker at ldc.upenn.edu
215.898.0946
---------------------------------------






More information about the Corpora-archive mailing list