[Corpora-List] A question about placement of notes in linguistically annotated corpora of Early Modern texts

Martin Mueller martin.mueller
Sun Jan 20 17:31:02 CET 2013

Phil Burns from Northwestern's IT group and I are working on a project to provide linguistic annotation for some 40,000 texts published between 1473 and 1700 and transcribed by the EEBO-TCP project. Currently, all these texts are available only to the members of institutions that have subscribed to them. But in 2015, some 25,000 texts will pass into the public domain, and over the following five years another 45,000 texts will follow them. Thus students of Early Modern English can look forward to a environment that will soon provide them with access anywhere anytime to a rich set of carefully encoded data from the first 250 years of English print culture.

A much smaller set of ~2,000 18th-century texts from the ECCO-TCP project has already been released into the public domain, and we expect to provide linguistically annotated versions of these texts at some point in the spring or early summer.

If potential users of these data sets have advice to offer, we would very much like to hear it, and I would like to seek your advice on a particular question. First a few remarks about the encoding of these texts. They were encoded in a modified of TEI P3 that will be transformed to TEI P5 in the course of our work. The encoding is light but consistent and allows you to exclude or focus on words that occur in paragraphs, lines of verse, epigraphs, notes, list and tables, speaker labels, epigraphs, opening and closing phrases of correspondence,and a few others. The linguistic annotation will be "element-aware" in the sense that different rules, probability tables, and supporting lexica will be used for stuff that is likely to be special, such as lines of verse, stage directions, or notes.

My particular question has to do with the encoding of notes, stuff put inside <note: elements. Early modern prose is full of notes. In the print originals they occur sometimes at the foot of page, but the great majority of them are marginal notes ( and they often are summaries rather than notes in a modern sense of the word). In the TCP transcriptions, foot notes and marginal notes are encoded inline. Footnotes are placed where their markers occur. Marginal notes are put where they fit best, following broad rules but leaving discretion to the transcribers. Here is a typical example from A Defence of the Catholyke Cause (1602):

<P>IT is now more then three yeres, gentle reader, since that one Edward Squyre,<NOTE PLACE="marg">Edvvard Squyre executed for a fayned conspiracy, and the author of this treatyse charge therevvith.</NOTE> hauing bin sometyme prisoner in Spayne, and escaping thence into England, was condemned and executed for a fayned conspiracy against her Maiestyes person, wherto my self &amp; some others were charged to be priuy; &amp; for as much as it seemed to mee that this fraudulent manner of our aduersaries proceeding against Catholykes, by way of slanders and diffamations, authorised with shew of publik Iustice,<NOTE PLACE="marg">The reasons that moued the author to vvryte an Apology in his ovvne defence.</NOTE> and continued now many yeres, did beginne to redound not only to the vndeserued disgrace, &amp; discredit of particular men wrongfully accused, but also to the dishonour of our whole cause, I thought it co~uenie~t to write an Apology in my defe~ce, &amp; to dedicate the same to the Lords of her Maiesties priuy counsel, as wel to cleare my self to their honours of the cryme falsly imputed vnto mee, as also to discouer vnto them the treacherous dealing of such as abuse her Maiesties autority and theirs in this behalf, to the spilling of much innocent blood, with no smalle blemish to her Maiesties gouernment, and the assured exposition of the whole state, to the wrath of God, if it be not remedied in tyme.</P>

MorphAdorner, Phil Burns' software, treats such <note> elements as "jump tags", treats their content separately, and "knows" about the reading order of the main text. We have two choices for for dealing with <note> elements. We could leave them where they are, or we could gather them in separate <div> elements, leaving sone form of marker at the original location of their encoding. That procedure would be reversible, and it could also be separately implemented by anybody manipulating the texts. So in some ways the question does not matter very much.

But from the OWL perspective (Piotr Banski's lovely term for "ordinary working linguist"), which choice would provide the better default setting and be more in keeping with practices elsewhere and the expectations of scholars who may work with those text? Notice that this question has nothing to do with the way in which notes would be displayed in a browser-based rendering of the texts. It is a question about which choice would on balance provide an easier or more profitable working environment.

My own view so far has been that there would be some advantages in grouping notes separately. It would make it a little easier to attend to notes as a genre in their own right, it would make it a little easier to process the main text because you wouldn't have to worry about stuff that interrupts the reading order, and from a philological perspective you could argue that wherever the notes were placed in the original, they certainly were not placed in the middle of the text. But I'm not very confident about my hunches in this regard, and if there is a consensus "out there" about best practices I would much rather follow that than my own nose.

I would welcome your advice, online or offline, on this topic as well as any information about the practices of comparable enterprises elsewhere.

With thanks in advance

Martin Mueller Professor emeritus Department of English and Classics Northwestern University

More information about the Corpora mailing list