[Corpora-List] Corpus documentation: how to describe a corpus?

Alexander Osherenko osherenko at gmx.de
Sat Jul 1 08:08:51 CEST 2017


BTW, is there any special literature about corpus validation and how it is done? A.

-- Alexander Osherenko, Dr. rer. nat. Senior HCI architect

Founder and R&D Socioware Development <http://www.socioware.de/osherenko_page.html>

Humboldt Innovation <http://www.humboldt-innovation.de/> Humboldt-Universitšt zu Berlin <http://www.hu-berlin.de/~osherena/>

Profile: ResearchGate <https://www.researchgate.net/profile/Alexander_Osherenko> Social interaction, globalization and computer-aided analysis <https://www.researchgate.net/publication/281644865_Social_Interaction_Globalization_and_Computer-Aided_Analysis_A_Practical_Guide_to_Developing_Social_Simulation> at Springer

2017-06-30 19:15 GMT+01:00 Martin Potthast <martin.potthast at uni-weimar.de>:


> I'd like to add:
> - Corpus *statistics*: Descriptive statistics about the corpus and
> potential sub-corpora interest.
> - Corpus *validation*: Any experiments and analyses to verify how much
> the corpus resembles the real world population from which it was sampled.
> Also, analyses regarding speific biases that may be expected.
> - Corpus *verticals*: Any subsets of interest of the corpus pertaining to
> certain variables and characteristics, allowing for experiments tailored to
> specific sub-groups of a population.
> - Corpus *software/reproducibility*: Any software that may help to
> reproduce and to recreate the annotation process resulting in a given
> corpus, to allow other to build their own versions.
>
> Martin
>
> On Fri, Jun 30, 2017 at 7:35 PM, Kevin B. Cohen <kevin.cohen at gmail.com>
> wrote:
>
>> Hi, Michael,
>>
>> Great question--thanks for bringing it up, and I wish that I knew the
>> answer! I hope that you'll collate/summarize responses.
>>
>> One way to think about this would be broad categories like the
>> following. Most papers on corpora talk about some of them, but not
>> necessarily all:
>>
>> Collection *process*: where did the data come from? What were the
>> inclusion/exclusion criteria? Were duplicates excluded? Did documents get
>> truncated at some maximum length? On and on...
>> Corpus *contents*: size of corpus *and *of corpora *and *of documents:
>> size in tokens, size in words, size in... (size in types gets you beyond
>> description to a theory of lemmas) On and on...
>> Annotation *process/results*: number of annotators (if any), backgrounds
>> of annotators (if any), agreement between annotators (if any), metadata...
>> on and on...
>> Distribution/*availability*: can one get the data? If so, how, and from
>> where? At what cost, and with what reannotation/redistribution
>> restrictions? On and on...
>>
>> Looking forward to other answers!
>>
>> Kevin
>>
>>
>> On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael <
>> Michael.Ustaszewski at uibk.ac.at> wrote:
>>
>>> Dear colleagues,
>>>
>>>
>>> I have a question about the most sensible and comprehensive way to
>>> summarise a corpus: In the documentation of a large multilingual
>>> translational corpus (comprising both a parallel and comparable section),
>>> what kind of data about the corpus should one provide in order to
>>> comprehensively characterise the corpus for the scientific community? The
>>> obvious information characterising a corpus is, of course:
>>>
>>>
>>> - languages and language pairs
>>> - size of the entire corpus and each subcorpus, measured in tokens and
>>> types
>>> - description of metadata
>>> - disclosure of text sources and sampling method
>>>
>>>
>>> But what else should one provide? Word frequency lists? Measures of
>>> lexical diversity? Plots of text lengths for each sub section of the
>>> corpus? Any other visualizations of the corpus or its subcorpora?
>>>
>>>
>>> Or to reformulate the question: Given that the aim of the documentation
>>> is to describe the corpus rather than to answer research questions, what
>>> key facts about a corpus do potential users expect when reading the
>>> documentation in order to decide weather the resource is of any value for
>>> him or her?
>>>
>>>
>>> Thank you in advance for your inputs, I am looking forward to an
>>> interesting discussion.
>>>
>>>
>>> Best,
>>>
>>> Michael Ustaszewski
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>>
>>
>>
>> --
>> Kevin Bretonnel Cohen, PhD
>> Director, Biomedical Text Mining Group
>> Computational Bioscience Program, U. Colorado School of Medicine
>> D'Alembert Chair in Natural Language Processing for the Biomedical Domain
>> LIMSI, CNRS, Universitť Paris-Saclay
>> 303-916-2417 <(303)%20916-2417>
>> http://compbio.ucdenver.edu/Hunter_lab/Cohen
>>
>>
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> Dr. Martin Potthast
> Bauhaus-Universitšt Weimar
> Digital Bauhaus Lab
> Bauhausstr. 9a
> 99423 Weimar
> Germany
>
> +49 3643 58 3567 <+49%203643%20583567>
> +49 171 809 1945 <+49%20171%208091945>
>
> www.potthast.net
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9715 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170701/69ecb8d1/attachment.txt>



More information about the Corpora mailing list