Ancestry.com, Inc. and the Department of Computer Science at Brigham Young University are working together to create a publically available, hand-annotated collection of images and OCR transcriptions of scanned documents. We would like your feedback before we get too far along. We hope to have this complete early next year.
This corpus is intended to facilitate evaluating document image analysis, OCR error correction and/or information extraction algorithms and related research. The images come from the scanned books and newspapers in a large collection at Ancestry.com and include the following kinds of documents:
. newspapers (typical newspapers from the 20th century)
. city directories (like old phone books)
. collage yearbooks (includes photos, names and majors)
. navy cruise books (like a yearbook for those who served on large US Navy ships)
. birth records books (recording birth events and family relationships among parents and children)
. local histories (histories of small geographical areas)
. family histories (multi-generational histories of particular families)
. church yearbooks (describes the organization and events of a local church congregation)
A few example images can be found here:
The particular selection of documents was motivated by genealogy and family history research, but we believe the final corpus with annotations (including manual transcriptions) will be of value outside this field of research.
We would like your help in refining our priorities for this corpus if you are likely to use this corpus in your research. We ask you to reply to this email with ideas you may have as well as helping us prioritize an existing wish-list of potential features by voting at this website:
This web page will iteratively present you with random pairs of features. Each feature is described briefly by a short sentence. For each pair of features, please click on the one that would be the most useful to you. After each click, you will be presented with the next pair. Feel free to vote in multiple sessions, on multiple days. You can continue voting for as long as you would like. The longer you vote, the better our ranking will be.
You can also add your own ideas to this wish-list. Please add them early so other people have a chance to vote for them, but only after you familiarize yourself with the features already listed. A complete list can be found by clicking on "View Results". We would also like to hear any feedback you may have when you reply to this email.
If you know of someone who might be interested in this corpus, please forward this email to them.
Thank you for your time.
Thomas L. Packer
Department of Computer Science
Brigham Young University
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13127 bytes Desc: not available URL: <http://www.uib.no/mailman/public/corpora/attachments/20100825/25969ed8/attachment.txt>