[Corpora-List] coreference annotation for penn treebank

Yannick Versley yannick.versley at unitn.it
Mon Feb 16 18:03:21 CET 2009

> Is anyone aware of any other large-scale coreference annotation efforts for
> the Wall Street Journal portion of the Penn TreeBank?
The ARRAU corpus combines the Vieira&Poesio data with some more data that has been annotated more recently http://cswww.essex.ac.uk/Research/nle/arrau/arrau-corpus-lrec2008 you would have to ask Massimo Poesio or Ron Artstein about the availability - I'm not sure if there has been an official release (as in: distributing the thing via a website) of it. The OntoNotes project has annotated a portion of the PTB: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04 The format it comes in is somewhat weird (SGML that is *not* meant to be parsed by an XML parser, the traces in the treebank appear as tokens which means that you have to figure out yourself which "0" in the treebank is really a token, and even in the second release, there are still obvious errors in it where [Korea and Japan] is coreferent with Korea, but "[Korea] and Japan" with "those two countries), but it's about 10x as big as MUC-6 and should definitely be worth a look. The only other coreference resources of that size that I know of would be the ACE corpora (which annotate only some semantic classes), and the TüBa-D/Z treebank-plus-coreference (which is in German).

Best, Yannick

More information about the Corpora mailing list