I'm working on a project in which we are attempting to characterize a few different corpora according to how "well-behaved" they are. That is, we want to show that some are more amenable in particular to parsing and part-of-speech tagging than others. Some of the corpora consist of complete, grammatical sentences and others are telegraphic, fragmentary text including a large number of abbreviations and misspellings.
One approach I've tried is to tag and parse each of the corpora with the Stanford tagger and parser, generating ranked lists of the unique tokens and tags and looking for certain errors / warnings / phrase structures in the parser output. For instance, I'm counting how many sentences the parser had to retry, how many it failed to find any parse for, how many it ran out of memory while processing, and how many FRAG (sentence fragment) phrases are found in the parser output.
Are there standard or widely accepted metrics for describing the well-behavedness of corpora?
Many thanks, Sean Igo