Hi Michael I spent much of 1991-2011 in creating and documenting corpora...

I would suggest two overarching factors: i) the purpose of the corpus ii) time available to do the description. Do as much as you can initially, and if you can, keep the description open for subsequent additions/amendments iii) different people will want different *levels* of detail in the description, so perhaps start with a summary and add links to more detailed information

Here are a few writings and links that may be of use to you....

#1 http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html

#2 [PDF]Corpus Design Criteria - British National Corpus www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf by S Atkins - ‎1991 - ‎Cited by 507 - ‎Related articles Corpus Design Criteria. Sue Atkins. Jeremy Clear. Nicholas Ostler. 15th January 1991. Contents. Introduction. 1. 1 Defining Text Collections and a Unit of Text. 1.

#3 http://ota.ox.ac.uk/documents/creating/dlc/

#4 http://acorn.aston.ac.uk/acorn_publication.html 2. More information about the texts in the ACORN corpora: a) English corpora b) French corpora c) German corpora d) Spanish corpora

#5 http://acorn.aston.ac.uk/acorn_publication.html (2002) The Bank of English past, present, and future: corpus size, composition, annotation, and software (unpublished; presented at The 2nd ILASH Half-Day Workshop on “Computational Language Resources”, University of Sheffield, Feb 8th 2002)

#6 Journal of English for Academic Purposes | Vol 6, Iss 4, Pgs 285-374 ... www.sciencedirect.com/science/journal/14751585/6 Issues in creating a corpus for EAP pedagogy and research. Original Research Article; Pages 356-373; Ramesh Krishnamurthy, Iztok Kosem. Abstract; PDF (263 ...

#7 http://catalog.elra.info/product_info.php?products_id=627

#8 many of my publications and unpublished writings are available at https://aston.academia.edu/RameshKrishnamurthy

best wishes Ramesh Krishnamurthy Visiting Academic Fellow Aston University

I have a question about the most sensible and comprehensive way to summarise a corpus: In the documentation of a large multilingual translational corpus (comprising both a parallel and comparable section), what kind of data about the corpus should one provide in order to comprehensively characterise the corpus for the scientific community? The obvious information characterising a corpus is, of course:

- languages and language pairs - size of the entire corpus and each subcorpus, measured in tokens and types - description of metadata - disclosure of text sources and sampling method

But what else should one provide? Word frequency lists? Measures of lexical diversity? Plots of text lengths for each sub section of the corpus? Any other visualizations of the corpus or its subcorpora?

Or to reformulate the question: Given that the aim of the documentation is to describe the corpus rather than to answer research questions, what key facts about a corpus do potential users expect when reading the documentation in order to decide weather the resource is of any value for him or her?

Thank you in advance for your inputs, I am looking forward to an interesting discussion.


