[Corpora-List] Corpus documentation: how to describe a corpus?

Kevin B. Cohen kevin.cohen at gmail.com
Fri Jun 30 19:35:58 CEST 2017

Hi, Michael,

Great question--thanks for bringing it up, and I wish that I knew the answer! I hope that you'll collate/summarize responses.

One way to think about this would be broad categories like the following. Most papers on corpora talk about some of them, but not necessarily all:

Collection *process*: where did the data come from? What were the inclusion/exclusion criteria? Were duplicates excluded? Did documents get truncated at some maximum length? On and on... Corpus *contents*: size of corpus *and *of corpora *and *of documents: size in tokens, size in words, size in... (size in types gets you beyond description to a theory of lemmas) On and on... Annotation *process/results*: number of annotators (if any), backgrounds of annotators (if any), agreement between annotators (if any), metadata... on and on... Distribution/*availability*: can one get the data? If so, how, and from where? At what cost, and with what reannotation/redistribution restrictions? On and on...

Looking forward to other answers!


On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael < Michael.Ustaszewski at uibk.ac.at> wrote:

> Dear colleagues,
> I have a question about the most sensible and comprehensive way to
> summarise a corpus: In the documentation of a large multilingual
> translational corpus (comprising both a parallel and comparable section),
> what kind of data about the corpus should one provide in order to
> comprehensively characterise the corpus for the scientific community? The
> obvious information characterising a corpus is, of course:
> - languages and language pairs
> - size of the entire corpus and each subcorpus, measured in tokens and
> types
> - description of metadata
> - disclosure of text sources and sampling method
> But what else should one provide? Word frequency lists? Measures of
> lexical diversity? Plots of text lengths for each sub section of the
> corpus? Any other visualizations of the corpus or its subcorpora?
> Or to reformulate the question: Given that the aim of the documentation is
> to describe the corpus rather than to answer research questions, what key
> facts about a corpus do potential users expect when reading the
> documentation in order to decide weather the resource is of any value for
> him or her?
> Thank you in advance for your inputs, I am looking forward to an
> interesting discussion.
> Best,
> Michael Ustaszewski
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- Kevin Bretonnel Cohen, PhD Director, Biomedical Text Mining Group Computational Bioscience Program, U. Colorado School of Medicine D'Alembert Chair in Natural Language Processing for the Biomedical Domain LIMSI, CNRS, Université Paris-Saclay 303-916-2417 http://compbio.ucdenver.edu/Hunter_lab/Cohen -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4632 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170630/835d8b95/attachment.txt>

More information about the Corpora mailing list