[Corpora-List] Metrics for corpus "parseability"

Steve Finch s.finch at daxtra.com
Tue Feb 5 17:27:39 CET 2008


I fail to see precisely what the 3-sat paper you reference, which seems to refer to the difficulty of solving (or of there being a solution to) certain mathematical equations as certain statistics of the form of the equation change, have to do with the issue under discussion. Maybe you intend to draw an analogy with language-form statistics such as reading age stats?

For example, there are various well-known "reading age" statistics that count the average length of sentences, the length of various words used (or correlated stats such as number of syllables). One is the Flesch readability score, another the Bormuth Grade Level (others evident from a quick peruse of google). They comprise somewhat arbitrary looking formulae such as 0.134*ASL + 5.2 * ASW - 2.134 (example made up, but ASL is Average Sentence Length, and ASW is Average Syllables per Word). Other statistics include proportion of passive verb forms, proportion of "easy" words, and various other forms of linguo-statistical JuJu. Most of them have the property that if you randomly rearrange the words in each sentence the statistic is invariant. Is this the sort of statistic you mean? I think these statistics are useful as a rule of thumb if you can *assume* the input is well formed and generated by a human being who is not trying to fool the system.

I think that the consensus position on reading age stats is that the following are probably true in general on average:

(1) Shorter sentences are easier to process for both humans and computers.

(2) Sentences of the same length containing shorter words are easier to process.

(3) Sentences of the same length containing more closed class words are likely to be easier to process. (NB - high correlation w/ (2))

(4) Sentences of the same length exhibiting certain syntactic phenomena such as the passive form (maybe evidenced or approximated by the presence of certain parts of speach) are likely to be harder to process.

(5) Sentences of the same length containing more common words are likely to be easier to process (NB correlation w/ (2) and (3)).

I think that (2), (3) and (5) are correlated and their relative contributions need to be teased apart.

Now all of these reading age statistics are at best "rule of thumb" estimates, but they are used by some publishers, and hence are likely to have some empirical basis (although I have not seen the science). What relation they may have to parsing algorithms is unclear, and there are clearly cases where they can be fooled by the mischievous. However, it might be interesting to investigate such statistics to see to what extent they correlate to algorithmic measures of grammatical coverage and/or accuracy of a given parser, for example. Such statistics are certainly something to control for in any attempt to devise a better and more well-founded statistic for "parseability".

- Steve.

On Tuesday 05 February 2008 08:28, Miles Osborne wrote:
> Actually, I think you have misunderstood what I said: this truly is about
> the data and not about "algorithms". What I said was that you need to be
> able to understand about the hardness of the sentences themselves, without
> reference to the parser etc. Read that sample paper and you will know what
> I mean.
> Miles
> On 05/02/2008, Adam Kilgarriff <adam at lexmasterclass.com> wrote:
> > On 04/02/2008, Miles Osborne <miles at inf.ed.ac.uk> wrote:
> > > I must confess, the idea that a corpus can be described in terms of
> > > "parseability" sounds a little ill-founded to me. The choice of
> > > particular parsing algorithm may dictate which examples are hard to
> > > process, as will the underlying grammar etc etc.
> >
> > I couldn't disagree more. It's the equivalent of saying that it's
> > ill-founded to evaluate parsers because they will always perform
> > differently on different corpora. It just goes to show that you're
> > interested in algorithms not data. The field is way imbalanced by people
> > who think more about algorithms than the corpora they apply them to.
> >
> > Adam
> >
> >
> > --
> >
> > > ================================================
> > > Adam Kilgarriff
> > > http://www.kilgarriff.co.uk
> > > Lexical Computing Ltd http://www.sketchengine.co.uk
> > > Lexicography MasterClass Ltd http://www.lexmasterclass.com
> > > Universities of Leeds and Sussex adam at lexmasterclass.com
> > > ================================================

-- Steven Finch Daxtra Technologies Tel: +44 (0)131 653 1250 Email: s.finch at daxtra.com

More information about the Corpora mailing list