[Corpora-List] Training Corpus for Readability Difficulty

Scott A. Crossley sacrossley at gmail.com
Sat Oct 18 18:55:15 CEST 2008

Some of Bormuth's passages along with their readability scores are available for free in ERIC documents (at least 32 of the passages).

This is a brief overview of the passages

Bormuth's (1971) corpus of 32 academic reading texts features texts taken from school instructional material and includes passages from biology, chemistry, civics, current affairs, economics, geography, history, literature, mathematics, and physics The mean length of the texts was 269.28 words (SD = 16.27) and the mean number of sentences per hundred words was 7.10 (SD = 2.81).

The problem is the minimal number of passages which constrain the number of variables you can statistically analyze without overfitting the model.

Here are the references:

Bormuth, J. R. (1969). Development of readability analyses (Final Report, Project No. 7-0052, Contract No. 1, OEC-3-7-070052-0326). Washington, DC: U. S. Office of Education.

Bormuth, J. R. (1971). Development of standards of readability: Toward a rational criterion of passage performance. U. S. Department of Health, Education and Welfare (ERIC Doc. No. ED O54 233).

Let me know if that helps

Scott Crossley, Ph.D. Linguistics/TESOL

Department of English Mississippi State University http://www.msstate.edu/dept/english/tesol/tesolfaculty.html (662) 325-2355

Institute for Intelligent Systems University of Memphis http://mnemosyne.csl.psyc.memphis.edu/iis/

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Albretch Mueller Sent: Saturday, October 18, 2008 11:42 AM To: corpora at uib.no Subject: [Corpora-List] Training Corpus for Readability Difficulty

> We are looking for a training corpus to study readability difficulty.
> ... Unfortunately he was unable to share it the last time I asked

this is what I would do: ~

1) search for educational and children web sites or ask "harry potter" himself ;-) what kinds of books children (of a certain culture and age) read. You should finish this step with a long list and be ready to be oddly amazed by the list; today's children universe has changed quite a bit (what is one of the most sold video games in America? One which theme is gunning down poor, powerless (, and black) Haitian people in Miami ...), then ~

2) go http://www.gutenberg.org and search for "children" (got 131 hits of books all of them in public domain) you may find some or similar ones and I am sure you could find a whole lot more ~

3) try to define "readability" in a more functional and perhaps measurable way. I would quickly think of a number of features you can easily (with some not that complicated code) and syntactically get at: ~

3.1) length of texts (as number of words that are and/or are not content words) ~

3.2) length of texts' sentences and/or paragraphs ~

3.3) dependency and "carried-over" sense among paragraphs ~

You/the code monkey you hire should try to stratify this information and define some metrics. Without defining "readability" first the type of corpora you have in mind would be an aimless project ~

I have thought about these same kinds of things but more in a "X as a second language" way, say if you speak L1 there are certain syntactic structures and false cognates in L2 you want to be exposed to. Both "syntactic structures" and "false cognates" can be measurably account for and parametrized ~ <blatantly_off_topic_ad>

I have theorized about and coded such things already and I would work for food ;-) provided it is an open source project Also, I speak English, German and Spanish (willing to learn any language, specially not Western one) </blatantly_off_topic_ad> ~

See you


_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list