[Corpora-List] Corpus documentation: how to describe a corpus?

Georg Rehm georg.rehm at gmail.com
Fri Jun 30 20:36:54 CEST 2017


Dear Michael,

all metadata schemas created for the description of corpora or, more generally, language resources, contain a multitude of different fields, aspects, dimensions of how best to “summarise” or to describe a corpus.

An important aspect that also deserves attention is the one of the life cycle – some first ideas are in this paper:

Georg Rehm. The Language Resource Life Cycle: Towards a Generic Model for Creating, Maintaining, Using and Distributing Language Resources. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2016), pages 2450-2454, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).

Best, Georg


> On 30 Jun 2017, at 20:15, Martin Potthast <martin.potthast at uni-weimar.de> wrote:
>
> I'd like to add:
> - Corpus statistics: Descriptive statistics about the corpus and potential sub-corpora interest.
> - Corpus validation: Any experiments and analyses to verify how much the corpus resembles the real world population from which it was sampled. Also, analyses regarding speific biases that may be expected.
> - Corpus verticals: Any subsets of interest of the corpus pertaining to certain variables and characteristics, allowing for experiments tailored to specific sub-groups of a population.
> - Corpus software/reproducibility: Any software that may help to reproduce and to recreate the annotation process resulting in a given corpus, to allow other to build their own versions.
>
> Martin
>
> On Fri, Jun 30, 2017 at 7:35 PM, Kevin B. Cohen <kevin.cohen at gmail.com <mailto:kevin.cohen at gmail.com>> wrote:
> Hi, Michael,
>
> Great question--thanks for bringing it up, and I wish that I knew the answer! I hope that you'll collate/summarize responses.
>
> One way to think about this would be broad categories like the following. Most papers on corpora talk about some of them, but not necessarily all:
>
> Collection process: where did the data come from? What were the inclusion/exclusion criteria? Were duplicates excluded? Did documents get truncated at some maximum length? On and on...
> Corpus contents: size of corpus and of corpora and of documents: size in tokens, size in words, size in... (size in types gets you beyond description to a theory of lemmas) On and on...
> Annotation process/results: number of annotators (if any), backgrounds of annotators (if any), agreement between annotators (if any), metadata... on and on...
> Distribution/availability: can one get the data? If so, how, and from where? At what cost, and with what reannotation/redistribution restrictions? On and on...
>
> Looking forward to other answers!
>
> Kevin
>
>
> On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael <Michael.Ustaszewski at uibk.ac.at <mailto:Michael.Ustaszewski at uibk.ac.at>> wrote:
> Dear colleagues,
>
>
> I have a question about the most sensible and comprehensive way to summarise a corpus: In the documentation of a large multilingual translational corpus (comprising both a parallel and comparable section), what kind of data about the corpus should one provide in order to comprehensively characterise the corpus for the scientific community? The obvious information characterising a corpus is, of course:
>
>
> - languages and language pairs
> - size of the entire corpus and each subcorpus, measured in tokens and types
> - description of metadata
> - disclosure of text sources and sampling method
>
>
>
> But what else should one provide? Word frequency lists? Measures of lexical diversity? Plots of text lengths for each sub section of the corpus? Any other visualizations of the corpus or its subcorpora?
>
>
> Or to reformulate the question: Given that the aim of the documentation is to describe the corpus rather than to answer research questions, what key facts about a corpus do potential users expect when reading the documentation in order to decide weather the resource is of any value for him or her?
>
>
> Thank you in advance for your inputs, I am looking forward to an interesting discussion.
>
>
> Best,
>
> Michael Ustaszewski
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> --
> Kevin Bretonnel Cohen, PhD
> Director, Biomedical Text Mining Group
> Computational Bioscience Program, U. Colorado School of Medicine
> D'Alembert Chair in Natural Language Processing for the Biomedical Domain
> LIMSI, CNRS, Universitť Paris-Saclay
> 303-916-2417
> http://compbio.ucdenver.edu/Hunter_lab/Cohen <http://compbio.ucdenver.edu/Hunter_lab/Cohen>
>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> --
> Dr. Martin Potthast
> Bauhaus-Universitšt Weimar
> Digital Bauhaus Lab
> Bauhausstr. 9a
> 99423 Weimar
> Germany
>
> +49 3643 58 3567
> +49 171 809 1945
>
> www.potthast.net <http://www.potthast.net/>_______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10179 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170630/43df6f7e/attachment.txt>



More information about the Corpora mailing list