[Corpora-List] Training Corpus for Readability Difficulty

Albretch Mueller lbrtchx at gmail.com
Sat Oct 18 18:42:17 CEST 2008

> We are looking for a training corpus to study readability difficulty.
> ... Unfortunately he was unable to share it the last time I asked

this is what I would do: ~

1) search for educational and children web sites or ask "harry potter" himself ;-) what kinds of books children (of a certain culture and age) read. You should finish this step with a long list and be ready to be oddly amazed by the list; today's children universe has changed quite a bit (what is one of the most sold video games in America? One which theme is gunning down poor, powerless (, and black) Haitian people in Miami ...), then ~

2) go http://www.gutenberg.org and search for "children" (got 131 hits of books all of them in public domain) you may find some or similar ones and I am sure you could find a whole lot more ~

3) try to define "readability" in a more functional and perhaps measurable way. I would quickly think of a number of features you can easily (with some not that complicated code) and syntactically get at: ~

3.1) length of texts (as number of words that are and/or are not content words) ~

3.2) length of texts' sentences and/or paragraphs ~

3.3) dependency and "carried-over" sense among paragraphs ~

You/the code monkey you hire should try to stratify this information and define some metrics. Without defining "readability" first the type of corpora you have in mind would be an aimless project ~

I have thought about these same kinds of things but more in a "X as a second language" way, say if you speak L1 there are certain syntactic structures and false cognates in L2 you want to be exposed to. Both "syntactic structures" and "false cognates" can be measurably account for and parametrized ~ <blatantly_off_topic_ad>

I have theorized about and coded such things already and I would work for food ;-) provided it is an open source project Also, I speak English, German and Spanish (willing to learn any language, specially not Western one) </blatantly_off_topic_ad> ~

See you


More information about the Corpora mailing list