[Corpora-List] corpus of plain text docs in English

Mark Davies Mark_Davies at byu.edu
Mon Apr 4 16:43:17 CEST 2011


I'm not sure how far back you want the texts. If it's just to the early 1800s or so, you might check the links at the 400 million word Corpus of Historical American English (http://corpus.byu.edu/coha): Help / Composition of Corpus. It provides suggestions for some nice text archives, like Project Gutenberg, Making of America, etc.

For anything farther back than the early 1800s, you could just use the older texts from Project Gutenberg, or the many online archives of authors of Early Modern English. If your library is a member, you'll also want to check the huge collection at Early English Books Online (EEBO) for the machine readable (as opposed to the PDF image) texts.


Mark Davies

============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906


** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
> petar at lml.bas.bg
> Sent: Friday, April 01, 2011 1:13 AM
> To: Corpora at uib.no
> Subject: [Corpora-List] corpus of plain text docs in English
> Dear Corpora members,
> I am working on a domain specific machine translation project. I am looking for a
> corpus of plain text (historical) documents in English. I would like to experiment
> whether standard n-gram model, trained on such docs, could be used to improve
> other machine translation techniques designed specially for historical docs. Would you
> recommend some corpora?
> Thank you.
> Best regards,
> Petar Mitankin
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list