I just wanted to make a comment about the term "gold standard".
Inter-annotator agreement is so low that it is misleading to call human annotation the "gold standard".
Anything that is called a gold standard for some purpose should be thoroughly vetted by multiple annotators -- both human and computerized.
The goal for computer annotation should be to *exceed* the quality of average human annotators. That does not imply that the computers surpass human ability. It just means that they can be more thorough and consistent in annotating documents.
And by the way, I wouldn't equate computer annotation with statistical annotation. It is true that most of them are statistical, but the best quality is achieved by using multiple *independent* methods -- for cross checking, for better coverage of rare but significant cases, and for avoiding systemic errors caused by using a single paradigm.
Ideally, you should only need to call a human expert for the unusual cases where the multiple computerized tests disagree.