I have a question about the most sensible and comprehensive way to summarise a corpus: In the documentation of a large multilingual translational corpus (comprising both a parallel and comparable section), what kind of data about the corpus should one provide in order to comprehensively characterise the corpus for the scientific community? The obvious information characterising a corpus is, of course:
- languages and language pairs - size of the entire corpus and each subcorpus, measured in tokens and types - description of metadata - disclosure of text sources and sampling method
But what else should one provide? Word frequency lists? Measures of lexical diversity? Plots of text lengths for each sub section of the corpus? Any other visualizations of the corpus or its subcorpora?
Or to reformulate the question: Given that the aim of the documentation is to describe the corpus rather than to answer research questions, what key facts about a corpus do potential users expect when reading the documentation in order to decide weather the resource is of any value for him or her?
Thank you in advance for your inputs, I am looking forward to an interesting discussion.
Michael Ustaszewski -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1812 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170630/31da2afb/attachment.txt>