The 8th edition of "Challenges in the Management of Large Corpora" is going to be held on the 16th of May 2020 (morning session) in Marseille, during the LREC-2020 conference.

Please note the new deadline for abstract submission: 23 February 2020, 23:59 CET.

Workshop description

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way?

The CMLC workshop series invites papers dealing with challenges that arise in particular in connection with very large corpora, on topics such as: sampling approaches; web harvesting approaches; quality assessment; efficient solutions for storage, processing, querying and analysis; dimension reduction and exploratory data visualization; interfaces/APIs and other approaches to make the corpus data as widely usable as possible; sustainability and interoperability in general; intellectual property rights and licensing. This year’s event will cover the whole range of the standard CMLC themes, with some new additions and adopting some of LREC 2020’s focus topics.

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

See http://corpora.ids-mannheim.de/cmlc-2020.html for more details and updates.

Important dates

* Deadline for abstract submission: 23 February 2020, 23:59 CET, via

the START manager exclusively


* Notification of acceptance: 12 March 2020

* Deadline for the submission of camera-ready papers: 26 March 2020

* Meeting: 16 May 2020, morning session

Submission categories

We invite anonymised extended abstracts for oral presentations on the topics listed above (PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5).

CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised. The number of poster slots is limited. If there is spare capacity in the poster session, we reserve the right to change the presentation format of accepted papers from oral presentation to poster. Such a change will not affect how the paper is presented in the proceedings.

Submissions will be accepted exclusively through the START system (https://www.softconf.com/lrec2020/CMLC-8/).

Programme Committee

# Laurence Anthony (Waseda University, Japan) # Vladimír Benko (Slovak Academy of Sciences) # Felix Bildhauer (IDS Mannheim) # Sonja Bosch (University of South Africa) # Dan Cristea ("Alexandru Ioan Cuza" University of Iasi) # Damir Ćavar (Indiana University) # Tomaž Erjavec (Jožef Stefan Institute) # Johannes Graën (University of Gothenburg, Pompeu Fabra University) # Andrew Hardie (Lancaster University) # Serge Heiden (ENS de Lyon) # Miloš Jakubíček (Lexical Computing Ltd.) # Dawn Knight (Cardiff University) # Natalia Kotsyba (Samsung Poland) # Michal Křen (Charles University, Prague) # Sandra Kübler (Indiana University, Bloomington) # Gaël Lejeune (Sorbonne Université) # Paul Rayson (Lancaster University) # Martin Reynaert (Tilburg University) # Laurent Romary (INRIA) # Kevin Scannell (Saint-Louis University) # Roland Schäfer (FU Berlin) # Serge Sharoff (University of Leeds) # Irena Spasic (Cardiff University) # Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences) # Ludovic Tanguy (University of Toulouse) # Dan Tufiş (Romanian Academy, Bucharest)

Organising Committee

    Institut für Deutsche Sprache, Mannheim

    Piotr Bański,Marc Kupietz,Harald Lüngen

    Berlin-Brandenburg Academy of Sciences

    Adrien Barbaresi

    Institute of Computational Linguistics, University of Zurich

    Simon Clematide


CMLC series homepage is located athttp://corpora.ids-mannheim.de/cmlc.html

