[Corpora-List] Corpus documentation: how to describe a corpus?

Ayman Eddakrouri a.eldakroury at aucegypt.edu
Sat Jul 1 00:58:55 CEST 2017

I think one of the most important thing when describing or documenting any publication or corpus is to firstly put its name (if any) and its creator, and to lastly put its location (link/URL). In between, we can add what you all have mentioned in this excellent threaded discussion.

Importantly, it will be very helpful if any entity or organization adopt all theses description fields and put a rule or a standard to order them according to their priority or any other things.

Ayman Eddakrouri PhD in Arabic Corpora

On Friday, June 30, 2017, Georg Rehm <georg.rehm at gmail.com> wrote:

> Dear Michael,
> all metadata schemas created for the description of corpora or, more
> generally, language resources, contain a multitude of different fields,
> aspects, dimensions of how best to “summarise” or to describe a corpus.
> An important aspect that also deserves attention is the one of the life
> cycle – some first ideas are in this paper:
> Georg Rehm. The Language Resource Life Cycle: Towards a Generic Model for
> Creating, Maintaining, Using and Distributing Language Resources. In
> Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck,
> Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan
> Odijk, and Stelios Piperidis, editors, Proceedings of the 10th Language
> Resources and Evaluation Conference (LREC 2016), pages 2450-2454, Portorož,
> Slovenia, May 2016. European Language Resources Association (ELRA).
> On 30 Jun 2017, at 20:15, Martin Potthast <martin.potthast at uni-weimar.de
> <javascript:_e(%7B%7D,'cvml','martin.potthast at uni-weimar.de');>> wrote:
> I'd like to add:
> - Corpus *statistics*: Descriptive statistics about the corpus and
> potential sub-corpora interest.
> - Corpus *validation*: Any experiments and analyses to verify how much
> the corpus resembles the real world population from which it was sampled.
> Also, analyses regarding speific biases that may be expected.
> - Corpus *verticals*: Any subsets of interest of the corpus pertaining to
> certain variables and characteristics, allowing for experiments tailored to
> specific sub-groups of a population.
> - Corpus *software/reproducibility*: Any software that may help to
> reproduce and to recreate the annotation process resulting in a given
> corpus, to allow other to build their own versions.
> On Fri, Jun 30, 2017 at 7:35 PM, Kevin B. Cohen <kevin.cohen at gmail.com
> <javascript:_e(%7B%7D,'cvml','kevin.cohen at gmail.com');>> wrote:
>> One way to think about this would be broad categories like the
>> following. Most papers on corpora talk about some of them, but not
>> necessarily all:
>> Collection *process*: where did the data come from? What were the
>> inclusion/exclusion criteria? Were duplicates excluded? Did documents get
>> truncated at some maximum length? On and on...
>> Corpus *contents*: size of corpus *and *of corpora *and *of documents:
>> size in tokens, size in words, size in... (size in types gets you beyond
>> description to a theory of lemmas) On and on...
>> Annotation *process/results*: number of annotators (if any), backgrounds
>> of annotators (if any), agreement between annotators (if any), metadata...
>> on and on...
>> Distribution/*availability*: can one get the data? If so, how, and from
>> where? At what cost, and with what reannotation/redistribution
>> restrictions? On and on...
>> On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael <
>> Michael.Ustaszewski at uibk.ac.at
>> <javascript:_e(%7B%7D,'cvml','Michael.Ustaszewski at uibk.ac.at');>> wrote:
>>> Dear colleagues,
>>> I have a question about the most sensible and comprehensive way to
>>> summarise a corpus: In the documentation of a large multilingual
>>> translational corpus (comprising both a parallel and comparable section),
>>> what kind of data about the corpus should one provide in order to
>>> comprehensively characterise the corpus for the scientific community? The
>>> obvious information characterising a corpus is, of course:
>>> - languages and language pairs
>>> - size of the entire corpus and each subcorpus, measured in tokens and
>>> types
>>> - description of metadata
>>> - disclosure of text sources and sampling method
>>> But what else should one provide? Word frequency lists? Measures of
>>> lexical diversity? Plots of text lengths for each sub section of the
>>> corpus? Any other visualizations of the corpus or its subcorpora?
>>> Or to reformulate the question: Given that the aim of the documentation
>>> is to describe the corpus rather than to answer research questions, what
>>> key facts about a corpus do potential users expect when reading the
>>> documentation in order to decide weather the resource is of any value for
>>> him or her?
>>> Thank you in advance for your inputs, I am looking forward to an
>>> interesting discussion.
