We are preparing a grant application to the NSF Computing Research Infrastructure program to fund the preparation of a treebank of 14.6 million words of the Open American National Corpus (Ide 2008). This treebank will be prepared on the basis of the English Resource Grammar (Flickinger 2000, 2011) using the Redwoods (Oepen et al 2004) methodology in which the grammar creates as parse forest and the annotators select the intended tree. In particular, we will produce 1 million words of hand-verified trees and an additional 13.6 million words where the trees were automatically selected, with an expected exact match parse selection accuracy of over 80% by the end of the project.
The treebank will include scripts to export selected vistas on the information including:
--- A variety of POS tagsets --- Constituent structures, again with a variety of node label sets --- Dependency structures, again in a variety of popular formats --- MRS semantic representations (Copestake et al 2005)
As part of our grant application, we are conducting a survey to better understand how this resource could be useful to the field.
Please take a few moments to answer the questions at this link:
Many thanks, Emily Bender, University of Washington Dan Flickinger, Stanford University