> > Are there standard or widely accepted metrics for describing the
> > well-behavedness of corpora?
>
> The answer is, I think, a resounding 'no'. There is disappointingly little
> work on systematically comparing corpora, or making objective general
> observations of one corpus in comparison to others. (Citations proving me
> wrong are most welcome. I'm aware of Sekine, Roland and Jurafsky,
> Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches
> on the topic)
>
About general observations of one corpus in comparison to others, there is a recent article (in French) about the different performance of NLP tools applied to corpora of different genres and domains:
Marie-Paule Jacques and Nathalie Aussenac-Gilles (2006). "Variabilité des performances des outils de TAL et genre textuel. Cas des patrons lexico-syntaxiques". TAL. Volume 47 – n° 1/2006, pp. 11-32
Cheers, Marina