[Corpora-List] Metrics for corpus "parseability"

Marina Santini marinamailinglists at gmail.com
Sun Feb 3 11:40:45 CET 2008


Hi Sean,


> > Are there standard or widely accepted metrics for describing the
> > well-behavedness of corpora?
>
> The answer is, I think, a resounding 'no'. There is disappointingly little
> work on systematically comparing corpora, or making objective general
> observations of one corpus in comparison to others. (Citations proving me
> wrong are most welcome. I'm aware of Sekine, Roland and Jurafsky,
> Cavaglia, also work on genre by eg Karlgren, Santini, Sharoff, which touches
> on the topic)
>

About general observations of one corpus in comparison to others, there is a recent article (in French) about the different performance of NLP tools applied to corpora of different genres and domains:

Marie-Paule Jacques and Nathalie Aussenac-Gilles (2006). "Variabilité des performances des outils de TAL et genre textuel. Cas des patrons lexico-syntaxiques". TAL. Volume 47 – n° 1/2006, pp. 11-32

Cheers, Marina



More information about the Corpora mailing list