[Corpora-List] Corpus documentation: how to describe a corpus?

Valerie Mapelli mapelli at elda.org
Mon Jul 3 10:19:49 CEST 2017


Hello,

For corpus validation, I suggest that you have a look at the work carried out by ELRA: http://www.elra.info/en/services-around-lrs/validation/

To come back to Michael's question on how to describe a corpus, having a persistent identifier for corpora has also become an important issue. You may find some hints about the use of the ISLRN (International Standard Language Resource Number) in the following article:

Valérie Mapelli, Vladimir Popescu, Lin Liu and Khalid Choukri (2016). Language Resource Citation: the ISLRN Dissemination and Further Developments. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016, Portorož, Slovenia: http://www.lrec-conf.org/proceedings/lrec2016/summaries/1256.html

Or subscribe directly for an ISLRN here: http://www.islrn.org/

Best,

Valérie Mapelli

Le 01/07/2017 à 08:08, Alexander Osherenko a écrit :
> BTW, is there any special literature about corpus validation and how
> it is done?
> A.
>
> --
> Alexander Osherenko, Dr. rer. nat.
> Senior HCI architect
>
> Founder and R&D
> Socioware Development <http://www.socioware.de/osherenko_page.html>
>
> Humboldt Innovation <http://www.humboldt-innovation.de/>
> Humboldt-Universität zu Berlin <http://www.hu-berlin.de/%7Eosherena/>
>
> Profile: ResearchGate
> <https://www.researchgate.net/profile/Alexander_Osherenko>
> Social interaction, globalization and computer-aided analysis
> <https://www.researchgate.net/publication/281644865_Social_Interaction_Globalization_and_Computer-Aided_Analysis_A_Practical_Guide_to_Developing_Social_Simulation> at
> Springer
>
> 2017-06-30 19:15 GMT+01:00 Martin Potthast
> <martin.potthast at uni-weimar.de <mailto:martin.potthast at uni-weimar.de>>:
>
> I'd like to add:
> - Corpus *statistics*: Descriptive statistics about the corpus and
> potential sub-corpora interest.
> - Corpus *validation*: Any experiments and analyses to verify how
> much the corpus resembles the real world population from which it
> was sampled. Also, analyses regarding speific biases that may be
> expected.
> - Corpus *verticals*: Any subsets of interest of the corpus
> pertaining to certain variables and characteristics, allowing for
> experiments tailored to specific sub-groups of a population.
> - Corpus *software/reproducibility*: Any software that may help to
> reproduce and to recreate the annotation process resulting in a
> given corpus, to allow other to build their own versions.
>
> Martin
>
> On Fri, Jun 30, 2017 at 7:35 PM, Kevin B. Cohen
> <kevin.cohen at gmail.com <mailto:kevin.cohen at gmail.com>> wrote:
>
> Hi, Michael,
>
> Great question--thanks for bringing it up, and I wish that I
> knew the answer! I hope that you'll collate/summarize responses.
>
> One way to think about this would be broad categories like the
> following. Most papers on corpora talk about some of them,
> but not necessarily all:
>
> Collection *process*: where did the data come from? What were
> the inclusion/exclusion criteria? Were duplicates excluded?
> Did documents get truncated at some maximum length? On and on...
> Corpus *contents*: size of corpus /and /of corpora /and /of
> documents: size in tokens, size in words, size in... (size in
> types gets you beyond description to a theory of lemmas) On
> and on...
> Annotation *process/results*: number of annotators (if any),
> backgrounds of annotators (if any), agreement between
> annotators (if any), metadata... on and on...
> Distribution/*availability*: can one get the data? If so,
> how, and from where? At what cost, and with what
> reannotation/redistribution restrictions? On and on...
>
> Looking forward to other answers!
>
> Kevin
>
>
> On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael
> <Michael.Ustaszewski at uibk.ac.at
> <mailto:Michael.Ustaszewski at uibk.ac.at>> wrote:
>
> Dear colleagues,
>
>
> I have a question about the most sensible and
> comprehensive way to summarise a corpus: In the
> documentation of a large multilingual translational corpus
> (comprising both a parallel and comparable section), what
> kind of data about the corpus should one provide in order
> to comprehensively characterise the corpus for the
> scientific community? The obvious information
> characterising a corpus is, of course:
>
>
> - languages and language pairs
> - size of the entire corpus and each subcorpus, measured
> in tokens and types
> - description of metadata
> - disclosure of text sources and sampling method
>
>
> But what else should one provide? Word frequency lists?
> Measures of lexical diversity? Plots of text lengths for
> each sub section of the corpus? Any other visualizations
> of the corpus or its subcorpora?
>
>
> Or to reformulate the question: Given that the aim of the
> documentation is to describe the corpus rather than to
> answer research questions, what key facts about a corpus
> do potential users expect when reading the documentation
> in order to decide weather the resource is of any value
> for him or her?
>
>
> Thank you in advance for your inputs, I am looking forward
> to an interesting discussion.
>
>
> Best,
>
> Michael Ustaszewski
>
>
> _______________________________________________
> UNSUBSCRIBE from this page:
> http://mailman.uib.no/options/corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> --
> Kevin Bretonnel Cohen, PhD
> Director, Biomedical Text Mining Group
> Computational Bioscience Program, U. Colorado School of Medicine
> D'Alembert Chair in Natural Language Processing for the
> Biomedical Domain
> LIMSI, CNRS, Université Paris-Saclay
> 303-916-2417 <tel:%28303%29%20916-2417>
> http://compbio.ucdenver.edu/Hunter_lab/Cohen
> <http://compbio.ucdenver.edu/Hunter_lab/Cohen>
>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page:
> http://mailman.uib.no/options/corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> --
> Dr. Martin Potthast
> Bauhaus-Universität Weimar
> Digital Bauhaus Lab
> Bauhausstr. 9a
> 99423 Weimar
> Germany
>
> +49 3643 58 3567 <tel:+49%203643%20583567>
> +49 171 809 1945 <tel:+49%20171%208091945>
>
> www.potthast.net <http://www.potthast.net>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- Valérie Mapelli Sales Manager tel +33 1 43 13 33 33 (secr.) / 33 32 (direct) fax +33 1 43 13 33 30 / skype ID: mapelli_elda

Obtain your ISLRN: www.islrn.org

www.elda.org / www.elra.info ELRA Catalogue of Language Resources: http://catalog.elra.info Universal Catalogue of Language Resources: http://universal.elra.info LREC Conference: www.lrec-conf.org

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 24098 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170703/a761c8ce/attachment.txt>



More information about the Corpora mailing list