[Corpora-List] announcement: penn-helsinki parsed corpus of early modern english

Beatrice Santorini beatrice at babel.ling.upenn.edu
Sun Mar 6 00:30:00 CET 2005

We are happy to announce the release of the Penn-Helsinki Parsed Corpus
of Early Modern English (PPCEME). The construction of the corpus was
funded by the National Endowment for the Humanities (Grant # PA
23382-99) and the National Science Foundation (Grant # BCS 99-05488).
The Principal Investigator on these grants was Anthony Kroch, Professor
of Linguistics, University of Pennsylvania and the research associate
primarily responsible for corpus construction was Dr. Beatrice

The PPCEME contains 1.8 million words of running text, annotated for
part of speech and sentence structure. It includes a parsed version of
the entire Early Modern English section of the Helsinki Corpus of
Historical English (600,000 words) and two equal-sized extensions of
the Helsinki samples. Where the Helsinki texts were not sufficiently
large to permit such extensions, new texts of similar genre and date
were substituted, thereby preserving the sociolinguistic
characteristics of the Helsinki corpus to the greatest extent possible.

The new corpus will be distributed along with the existing PPCME2, the
Penn-Helsinki Parsed Corpus of Middle English, which has been somewhat
updated for the new release, under the same conditions of use. The two
corpora share the same annotation system and the release CD contains a
new version of the annotation manual, which has been revised to explain
the annotation system more fully and now contains an extensive index.
The new manual also explains the small number of differences in the
annotation schemes of the PPCEME and the PPCME2. Information on
obtaining the release CD is available at:


The search program CorpusSearch that accompanies our corpora has been
entirely redesigned and reprogrammed by its author, Beth Randall. The
new version of the program, CorpusSearch 2, has been released as open
source software on the Sourceforge web site. It is included on the
release CD, and the latest version of the program will always be
downloadable from Sourceforge at the URL:


This web site also contains the Users Guide and a facility for
reporting bugs, as well as the program's source code.

The PPCME2 and PPCEME, along with CorpusSearch 2, will only be
distributed as a single distribution CD, at a cost of US$300. However,
anyone with a license for the PPCME2 can purchase a license for the new
corpus at a cost of US$50. The update will include the new version of
the PPCME2 and all of the other updates described above.

The Penn historical corpora are part of a larger project to produce
parsed corpora of historical English. The other participants, Anthony
Warner, Susan Pintzuk, and Ann Taylor at the University of York, have
released the York-Toronto-Helsinki Parsed Corpus of Old English Prose.
Please see their web site for details:


Additional corpora currently under construction at Penn and York

The Penn Parsed Corpus of Modern British English
The York-Helsinki Parsed Corpus of Early English Correspondence

More information about the Corpora-archive mailing list