[Corpora-List] Workshop on Human Evaluation of NLP Systems (HumEval): First Call for Papers

a belz a.s.belz at gmail.com
Wed Dec 2 14:14:30 CET 2020

Workshop on Human Evaluation of NLP Systems (HumEval)

EACL’21, Kiev, Ukraine, 19-20 April 2021


First Call for Papers

The HumEval Workshop invites the submission of long and short papers on substantial, original, and unpublished research on all aspects of human evaluation of NLP systems, both intrinsic and extrinsic, including but by no means limited to NLP systems whose output is language.

Invited Speakers

Mohit Bansal, UNC Chapel Hill, US

Margaret Mitchell, Google, US

Lucia Specia, UCL, UK

Important Dates

Dec 2: First Call for Workshop Papers

Dec 18: Second Call for Workshop Papers

Jan 18: Workshop Paper Due Date

Feb 18: Notification of Acceptance

Mar 01: Camera-ready papers due

Apr 19-20: Workshop Dates

All deadlines are 11.59 pm UTC-12.

Workshop Topic and Content

Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced automatic evaluation metrics from BLEU (Papineni et al, 2001) onwards, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Over et al., 2007; Gatt and Belz, 2008; Bojar et al., 2016; Shimorina, 2018; Ma et al., 2019; Mille et al., 2019; Dušek et al., 2020).

Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn. More than 200 different quality criteria (Fluency, Grammaticality, etc.) have been used in NLP (Howcroft et al., 2020). Different papers use the same quality criterion name with different definitions, and the same definition with different names. As a result, we currently do not have a way of determining whether two evaluations assess the same thing which poses problems for both meta-evaluation and reproducibility assessments (Belz et al., 2020).

Reproducibility in the context of automatically computed system scores has recently attracted a lot of attention, against the background of a troubling history (Pedersen, 2008; Mieskes et al., 2019), where reproduction is perceived as failing in 24.9% of cases for own results, and in 56.7% for another team’s (Mieskes et al., 2019). Initiatives have included the Reproducibility Challenge (Pineau et al., 2019, Sinha et al., 2020); the Reproduction Paper special category at COLING'18; the reproducibility programme at NeurIPS'19 comprising code submission, a reproducibility challenge, and the ML Reproducibility checklist, also adopted by EMNLP'20 and AAAI'21; and the REPROLANG shared task at LREC'20 (Branco et al., 2020).

However, reproducibility in the context of system scores obtained via human evaluations has barely been addressed at all, with a tiny number of papers (e.g. Belz & Kow, 2010; Cooper & Shardlow, 2020) reporting attempted reproductions of results. The developments in reproducibility of automatically computed scores listed above are important, but it is concerning that not a single one of the initiatives and events above addresses human evaluations. E.g. if a paper fully complies with all of the NeurIPS'19/EMNLP'20 reproducibility criteria, any human evaluation results reported in it may not be reproducible to any degree, simply because the criteria do not address human evaluation in any way.

With this workshop we wish to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues that human evaluation in NLP currently faces, including aspects of experimental design, reporting standards, meta-evaluation and reproducibility. We invite papers on topics including, but not limited to, the following:


Experimental design for human evaluations


Reproducibility of human evaluations


Ethical considerations in human evaluation of computational systems


Quality assurance for human evaluation


Crowdsourcing for human evaluation


Issues in meta-evaluation of automatic metrics by correlation with human



Alternative forms of meta-evaluation and validation of human evaluations


Comparability of different human evaluations


Methods for assessing the quality of human evaluations


Methods for assessing the reliability of human evaluations


Work on measuring inter-evaluator and intra-evaluator agreement


Frameworks, model cards and checklists for human evaluation


Explorations of the role of human evaluation in the context of

Responsible AI and Accountable AI


Protocols for human evaluation experiments in NLP

We welcome work on the above topics and more from any subfield of NLP (and ML/AI more generally), with a particular focus on evaluation of systems that produce language as output. We explicitly encourage the submission of work on both intrinsic and extrinsic evaluation.

Paper Submission Information

Long Papers:

Long papers must describe substantial, original, completed and unpublished work. Wherever appropriate, concrete evaluation and analysis should be included.

Long papers may consist of up to eight (8) pages of content, plus unlimited pages of references. Final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers' comments can be taken into account.

Long papers will be presented orally or as posters as determined by the programme committee. Cecisions as to which papers will be presented orally and which as posters will be based on the nature rather than the quality of the work. There will be no distinction in the proceedings between long papers presented orally and as posters.

Short Papers:

Short paper submissions must describe original and unpublished work. Short papers should have a point that can be made in a few pages. Examples of short papers are a focused contribution, a negative result, an opinion piece, an interesting application nugget, a small set of interesting results.

Short papers may consist of up to four (4) pages of content, plus unlimited pages of references. Final versions of short papers will be given one additional page of content (up to 5 pages) so that reviewers' comments can be taken into account.

Short papers will be presented orally or as posters as determined by the programme committee. While short papers will be distinguished from long papers in the proceedings, there will be no distinction in the proceedings between short papers presented orally and as posters.

Review forms will be made available prior to the deadlines. For more information on applicable policies, see the ACL Policies for Submission, Review, and Citation <https://www.aclweb.org/adminwiki/index.php?title=ACL_Policies_for_Submission,_Review_and_Citation> .

Multiple Submission Policy

HumEval’21 allows multiple submissions. However, if a submission has already been, or is planned to be, submitted to another event, this must be clearly stated in the submission

Ethics Policy

Authors are required to honour the ethical code set out in the ACL Code of Ethics <https://www.aclweb.org/portal/content/acl-code-ethics>.

The consideration of the ethical impact of our research, use of data, and potential applications of our work has always been an important consideration, and as artificial intelligence is becoming more mainstream, these issues are increasingly pertinent. We ask that all authors read the code, and ensure that their work is conformant to this code. Where a paper may raise ethical issues, we ask that you include in the paper an explicit discussion of these issues, which will be taken into account in the review process. We reserve the right to reject papers on ethical grounds, where the authors are judged to have operated counter to the ACL Code of Ethics, or have inadequately addressed legitimate ethical concerns with their work.

Paper Submission and Templates

Submission is electronic, using the Softconf START conference management system. For electronic submission of all papers, please use: <https://www.softconf.com/eacl2021/papers> https://www.softconf.com/eacl2021/HumEval2021. Both long and short papers must follow the ACL Author Guidelines <https://www.aclweb.org/adminwiki/index.php?title=ACL_Author_Guidelines>, and must use the EACL’21 templates. You can find the EACL-2021 LaTeX template here <https://www.overleaf.com/latex/templates/eacl-2021-proceedings-template/jprrhhtnbrrm> or download the zip file <https://2021.eacl.org/downloads/eacl2021-templates.zip>.


Anya Belz, University of Brighton, UK

Shubham Agarwal, Heriot Watt University, UK

Yvette Graham, Trinity College Dublin, Ireland

Ehud Reiter, University of Aberdeen

Anastasia Shimorina, Universitť de Lorraine / LORIA

PC Members

Mohit Bansal, UNC Chapel Hill, US

Saad Mahamood, Trivago, DE

Kevin B. Cohen, University of Colorado, US

Nitika Mathur, University of Melbourne, Australia

Kees van Deemter, Utrecht University, NL

Margot Mieskes, UAS Darmstadt, DE

Ondrej Dusek, Charles University, Czechia

Emiel van Miltenburg, Tilburg University, NL

KarŽn Fort, Sorbonne University, France

Margaret Mitchell, Google, US

Anette Frank, University of Heidelberg, DE

Mathias Mueller, University of Zurich, CH

Claire Gardent, CNRS/LORIA Nancy, France

Malvina Nissim, Groningen University, NL

Albert Gatt, Malta University, Malta

Juri Opitz, University of Heidelberg, DE

Dimitra Gkatzia, Edinburgh Napier University, UK

Ramakanth Pasunuru, UNC Chapel Hill, US

Helen Hastie, Heriot-Watt University, UK

Maxime Peyrard, EPFL, CH

David Howcroft, Heriot Watt University, UK

Inioluwa Deborah Raji, Ai Now Institute, US

Jackie Chi Kit Cheung, McGill University, Canada

Verena Rieser, Heriot Watt University, UK

Samuel Lšubli, University of Zurich, CH

Samira Shaikh, UNC, US

Chris van der Lee, Tilburg University, NL

Lucia Specia, UCL, UK

Nelson Liu, Washington University, US

Wei Zhao, TU Darmstadt, DE

Qun Liu, Huawei Noah’s Ark Lab, China

Contact Information

humeval.ws at gmail.com

https://humeval.github.io -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 51077 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20201202/c6110df7/attachment.txt>

More information about the Corpora mailing list