[Corpora-List] Syntactic parsing performance by humans?

chris brew cbrew at acm.org
Fri May 13 19:30:42 CEST 2016


It is an unarguable fact that Google's parser gets a higher score, on the metrics chosen, which are completely standard in the NLP community. What is really being measured is what percentage of the links in a graph that links words to words via labeled links are correct. If, as is common, there are many words in the sentence, there will be many links too, and many opportunities for mistakes. You could get a 90% score and still have a mistake or two in nearly every sentence.

Whether this quality level is OK depends entirely on what use you plan to make of the graph that has been produced.

The Penn Treebank was made many years ago, with version 2 coming out in 1995. We have learnt a lot about how to annotate corpora and evaluate parsing since then. The Web Treebank is much newer, and reflects painfully learned best practices, so should be good quality, but is on the other hand dealing with much messier language, so performance scores are lower.

The current practice of evaluating individual dependencies was introduced as a result of major deficiencies in the first evaluation metrics that were used. It has the major plus of being transparent and straightforward. I believe that improvements in the metric will usually translate into improvements for downstream tasks that use parsing as inputs, and I wasn't so sure using earlier metrics. This is progress, but quite modest progress.

On 13 May 2016 at 12:55, Darren Cook <darren at dcook.org> wrote:


> Google have trained a neural net (part of publicizing their open-source
> TensorFlow framework?) to parse syntax, claiming it is the world's best:
>
>
> http://googleresearch.blogspot.co.uk/2016/05/announcing-syntaxnet-worlds-most.html
>
> I just wanted to quote this bit, on performance: (they've called in
> Parsey McParseface)
>
> "Parsey McParseface recovers individual dependencies between words
> with over 94% accuracy, ... While there are no explicit studies in the
> literature about human performance, we know from our in-house annotation
> projects that linguists trained for this task agree in 96-97% of the
> cases ... Sentences drawn from the web are a lot harder to analyze,
> ...[it] achieves just over 90% of parse accuracy on this dataset. "
>
> Are there really no studies of human performance?! Surely some professor
> has hinted to their PhD students that it is a nice bit of relatively
> easy linguistics research, that should also get them cited a lot...
>
> (I was mainly curious what the human performance gap between Penn
> Treebank and Google WebTreebank would be; if it would be more or less
> than the 4% gap for the deep learning algorithm.)
>
> Darren
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3696 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160513/c0f5b9cc/attachment.txt>



More information about the Corpora mailing list