[Corpora-List] Corpus documentation: how to describe a corpus?

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Sat Jul 1 10:33:07 CEST 2017

Hi Michael, Kevin,

I think the situation is a bit different in case of the biggest corpora being built from the web which are usually missing any sorts of reliable metadata (beyond a URL and crawling date). What's in there usually takes a while to discover and one way of having a quick assessment is computing similarity to other corpora (esp. those where one has a better picture available).

My two obvious references here would be:

Comparing corpora A Kilgarriff - International journal of corpus linguistics, 2001


Getting to know your corpus A Kilgarriff - Proc. Text, Speech, Dialogue, 2012

The missing bit (though still on my agenda) for corpus similarity is corpus homogeneity/heterogeneity, another valuable property to know, and one that actually needs to precede the similarity. If I say that A is more similar to X than B is to X, I need to know the homogeneity of A, B and X. Some corpora (e.g. BNC) are so heterogeneous that computing similarity to the whole BNC does not provide much information, so one needs then e.g. to compare just to a subcorpus.

Best Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On 30 June 2017 at 19:35, Kevin B. Cohen <kevin.cohen at gmail.com> wrote:

> Hi, Michael,
> Great question--thanks for bringing it up, and I wish that I knew the
> answer! I hope that you'll collate/summarize responses.
> One way to think about this would be broad categories like the following.
> Most papers on corpora talk about some of them, but not necessarily all:
> Collection *process*: where did the data come from? What were the
> inclusion/exclusion criteria? Were duplicates excluded? Did documents get
> truncated at some maximum length? On and on...
> Corpus *contents*: size of corpus *and *of corpora *and *of documents:
> size in tokens, size in words, size in... (size in types gets you beyond
> description to a theory of lemmas) On and on...
> Annotation *process/results*: number of annotators (if any), backgrounds
> of annotators (if any), agreement between annotators (if any), metadata...
> on and on...
> Distribution/*availability*: can one get the data? If so, how, and from
> where? At what cost, and with what reannotation/redistribution
> restrictions? On and on...
> Looking forward to other answers!
> Kevin
> On Fri, Jun 30, 2017 at 6:17 AM, Ustaszewski, Michael <
> Michael.Ustaszewski at uibk.ac.at> wrote:
>> Dear colleagues,
>> I have a question about the most sensible and comprehensive way to
>> summarise a corpus: In the documentation of a large multilingual
>> translational corpus (comprising both a parallel and comparable section),
>> what kind of data about the corpus should one provide in order to
>> comprehensively characterise the corpus for the scientific community? The
>> obvious information characterising a corpus is, of course:
>> - languages and language pairs
>> - size of the entire corpus and each subcorpus, measured in tokens and
>> types
>> - description of metadata
>> - disclosure of text sources and sampling method
>> But what else should one provide? Word frequency lists? Measures of
>> lexical diversity? Plots of text lengths for each sub section of the
>> corpus? Any other visualizations of the corpus or its subcorpora?
>> Or to reformulate the question: Given that the aim of the documentation
>> is to describe the corpus rather than to answer research questions, what
>> key facts about a corpus do potential users expect when reading the
>> documentation in order to decide weather the resource is of any value for
>> him or her?
>> Thank you in advance for your inputs, I am looking forward to an
>> interesting discussion.
>> Best,
>> Michael Ustaszewski
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> --
> Kevin Bretonnel Cohen, PhD
> Director, Biomedical Text Mining Group
> Computational Bioscience Program, U. Colorado School of Medicine
> D'Alembert Chair in Natural Language Processing for the Biomedical Domain
> LIMSI, CNRS, UniversitÚ Paris-Saclay
> 303-916-2417
> http://compbio.ucdenver.edu/Hunter_lab/Cohen
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7499 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170701/30959ba9/attachment.txt>

More information about the Corpora mailing list