[Corpora-List] Metrics for corpus "parseability"

Marina Santini marinamailinglists at gmail.com
Sun Feb 3 11:40:45 CET 2008

Hi Sean,

> > Are there standard or widely accepted metrics for describing the
> > well-behavedness of corpora?
> The answer is, I think, a resounding 'no'. There is disappointingly little
> work on systematically comparing corpora, or making objective general
> observations of one corpus in comparison to others. (Citations proving me
> wrong are most welcome. I'm aware of Sekine, Roland and Jurafsky,
> Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches
> on the topic)

About general observations of one corpus in comparison to others, there is a recent article (in French) about the different performance of NLP tools applied to corpora of different genres and domains:

Marie-Paule Jacques and Nathalie Aussenac-Gilles (2006). "Variabilité des performances des outils de TAL et genre textuel. Cas des patrons lexico-syntaxiques". TAL. Volume 47 – n° 1/2006, pp. 11-32

Cheers, Marina

More information about the Corpora mailing list