Hi, Michael,

Great question--thanks for bringing it up, and I wish that I knew the answer! I hope that you'll collate/summarize responses.

One way to think about this would be broad categories like the following. Most papers on corpora talk about some of them, but not necessarily all:

Collection *process*: where did the data come from? What were the inclusion/exclusion criteria? Were duplicates excluded? Did documents get truncated at some maximum length? On and on... Corpus *contents*: size of corpus *and *of corpora *and *of documents: size in tokens, size in words, size in... (size in types gets you beyond description to a theory of lemmas) On and on... Annotation *process/results*: number of annotators (if any), backgrounds of annotators (if any), agreement between annotators (if any), metadata... on and on... Distribution/*availability*: can one get the data? If so, how, and from where? At what cost, and with what reannotation/redistribution restrictions? On and on...

Looking forward to other answers!


> Dear colleagues,
> I have a question about the most sensible and comprehensive way to
> summarise a corpus: In the documentation of a large multilingual
> translational corpus (comprising both a parallel and comparable section),
> what kind of data about the corpus should one provide in order to
> comprehensively characterise the corpus for the scientific community? The
> obvious information characterising a corpus is, of course:
> - languages and language pairs
> - size of the entire corpus and each subcorpus, measured in tokens and
> types
> - description of metadata
> - disclosure of text sources and sampling method
> But what else should one provide? Word frequency lists? Measures of
> lexical diversity? Plots of text lengths for each sub section of the
> corpus? Any other visualizations of the corpus or its subcorpora?
> Or to reformulate the question: Given that the aim of the documentation is
> to describe the corpus rather than to answer research questions, what key
> facts about a corpus do potential users expect when reading the
> documentation in order to decide weather the resource is of any value for
> him or her?
> Thank you in advance for your inputs, I am looking forward to an
> interesting discussion.
> Best,
> Michael Ustaszewski
