The 9th edition of "Challenges in the Management of Large Corpora" is going to be held on the 12th of July, as a virtual pre-conference workshop before the Corpus Linguistics 2021 conference.

9^th Workshop on the Challenges in the Management of Large Corpora

/Special Topic: Design and Management of Research Software/

When and where/how

CMLC-9 is going to take place on the 12^th of July (hours t.b.a) and it is going to be a virtual event: a pre-conference workshop at Corpus Linguistics 2021 conference <https://www.cl2021.org/>, hosted online by University of Limerick, Ireland.

Important dates

* Deadline for abstract submission: 13 May 2021 (midnight UTC)

* Notification of acceptance: 31 May 2021

* Deadline for the submission of camera-ready papers: 25 June 2021

* Online Meeting: 12 July 2021 (hours t.b.a), online

Abstract submission

We invite anonymised extended abstracts for oral presentations on the topics listed above (ideally using the ACL-IJCNLP 2021 templates <https://2021.aclweb.org/calls/papers/#paper-submission-and-templates>, or PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5). Submissions are accepted through the EasyChair submission system, at https://easychair.org/conferences/?conf=cmlc9 <https://easychair.org/conferences/?conf=cmlc9> .

For final submissions, please use the ACL-IJCNLP 2021 templates <https://2021.aclweb.org/calls/papers/#paper-submission-and-templates>

Workshop description

The upcoming CMLC meeting continues the successful series of “Challenges in the management of large corpora” events, previously hosted at LREC (since 2012) and CL (since 2015) conferences. As in the previous meetings, we wish to explore common areas of interest across a range of issues in linguistic research data and tool management, corpus linguistics, natural language processing, and data science, with a special focus on tools, this time.

Linguistic research software and other topics of interest

To an even greater extent than in other disciplines, linguistic research data can hardly be used without the help of appropriate research software. As frequently noted at CMLC events, this often relates to the need for client/server approaches, as language data cannot usually be downloaded and processed on the home or lab PC, for legal and logistical reasons. Additionally, due to the complexity and high dimensionality of linguistic data and the unknown nature of the variation factors, specialised tools are needed on the way from raw data to their interpretation. These tools cannot be considered part of a general technical infrastructure.

Starting with the reconstruction or transformation of the raw data and e.g. its tokenization, the linguistic assumptions and decisions, as well as errors, manifested in research tools have as much influence on observations and possibly on research results as the research data itself – if data and tools can be treated separately at all. While approaches to the management of research data have been discussed quite broadly in the last 15 years, this was at best only marginally the case for research tools.

For this reason, CMLC-9 will focus on approaches to the design, development and management of research software (while not ignoring the other CMLC topics):

* Software development for linguistic research

o Design

+ scientific criteria in software development

+ standards and good practices

o Quality management and control

+ testing

+ code review

o Licensing

+ source code licenses

+ contributor license agreements

+ embedded content licensing

o Lifecycle management

+ maintenance

+ reproducibility

+ availability on operating systems/platforms

o Contact with research community

+ teaching

+ software documentation and accessibility

+ potential “power users” and bug reporters

* Linguistic content challenges

o Dealing with the variety of language: multilinguality,

historical texts, noisy OCR texts, user-generated content, etc.

o Integration of human computation (crowdsourcing) and automatic


o Quality management of annotations

o Dealing with different linguistic data types (corpora,

facsimiles, experimental data, neuroimaging data, …)

* Technical challenges

o Storage and retrieval solutions for big textual data corpora:

primary data (potentially including facsimiles, etc.), metadata,

and annotation data

o Scalable and efficient NLP tooling for annotating and analysing

large datasets: distributed and GPGPU computing; using big data

analysis frameworks for language processing

o Dealing with streaming data (e.g. Social Media) and rapidly

changing corpora

o Environmental impact of big language data computing

* Exploitation challenges

o Legal and privacy issues

+ new opportunities and issues after national implementations

of EU Directive 2019/790 on copyright and related rights in

the Digital Single Market

o Query languages, data models, and standardization

o Licensing models of open and closed data, coping with

intellectual property restrictions

o Innovative approaches for aggregation and visualisation of text


In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.


Online proceedings will be published before the meeting in a peer-reviewed, open-access volume. (See e.g. the proceedings volume from the 2019 meeting <https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8998>).

Programme Committee

Names will be added as Programme Committee members confirm their participation.

* Laurence Anthony (Waseda University, Japan)

* Vladimír Benko (Slovak Academy of Sciences)

* Felix Bildhauer (IDS Mannheim)

* Nils Diewald (IDS Mannheim)

* Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)

* Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg)

* Johannes Graën (University of Zurich, Switzerland)

* Andrew Hardie (Lancaster University, UK)

* Serge Heiden (ENS de Lyon/IHRIM, France)

* Miloš Jakubíček (Lexical Computing Ltd.)

* Natalia Kotsyba (Samsung Poland)

* Dawn Knight (Cardiff University)

* Michal Křen (Charles University, Prague)

* Sandra Kübler (Indiana University, USA)

* Veronika Laippala (Turku University)

* Jochen Leidner (Thomson Reuters, UK)

* Vereina Lyding (EURAC Research, Italy

* Paul Rayson (Lancaster University, UK

* Laurent Romary (INRIA)

* Jan-Oliver Rüdiger (IDS Mannheim)

* Kevin Scannell (Saint-Louis University)

* Roland Schäfer (FU Berlin)

* Roman Schneider (IDS Mannheim, Germany)

* Serge Sharoff (University of Leeds)

* Irena Spasić (Cardiff University, UK)

* Ludovic Tanguy (University of Toulouse)

Organising Committee

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide


CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html <http://corpora.ids-mannheim.de/cmlc.html>

Challenges in the Management of Large Corpora (CMLC-9)

-- Dr. Harald Lüngen Leibniz-Institut für Deutsche Sprache Programmbereich Korpuslinguistik R5, 6-13 D-68161 Mannheim Tel. +49 621 1581-418 Fax +49 621 1581-200

