[Corpora-List] Call for Participation // Data Statements for NLP --- virtual event online

Emily M. Bender ebender at uw.edu
Sat Apr 18 01:04:35 CEST 2020

The following event has been moved online, 11-13 May 2020, 2-4pm GMT. There is no registration fee. If you are interested in participating, please fill out this survey:

- https://catalyst.uw.edu/webq/survey/ebender/388116


Data Statements for NLP: Towards Best Practices 11-13 May 2020 Call for Participation


We invite participants who are currently developing NLP datasets to join us for a one-day working meeting at LREC 2020 to develop data statements for their datasets and develop and refine best practices for data statement creation. In this open collaboration session, participants will develop data statements (Bender & Friedman 2018) for specific datasets, and in the process refine a set of best practices for creating data statements. Specifically, workshop participants will: (1) be introduced to the concept, structure, and uses of data statements; (2) draft a data statement for the dataset(s) they brought to the workshop; (3) work in small groups to critique and refine their data statements; and (4) reflect on best practices for writing and disseminating data statements.

This event will be organized differently from typical workshops. It is an open collaboration session providing a structured opportunity for a diverse range of participants in our community to help shape and codify best practices. The deliverables from this workshop will be (a) data statements for each participants' data set and (b) a preliminary best practices document. These will be disseminated online, together with the overview materials provided by the workshop organizers, with the data statements providing examples illustrating the results of following the preliminary best practices.

There will be no reviewing process ahead of this workshop, nor any proceedings. All participants are welcome, and we especially encourage attendance by people who are currently developing datasets for NLP.

We will work towards best practices for creating data statements, exploring questions like the following:

* How can the information required be efficiently collected? * What steps can be taken in the planning for a dataset to facilitate the collection of relevant metadata about speakers and annotators? * What heuristics are there for writing data statements that are concise and informative? * How can we incorporate material from institutional review board/ethics committee applications into the data statement schema? * How can we best settle on an appropriate level of detail given privacy concerns, especially for small or vulnerable populations? * How can we produce data statements for older datasets that predate this practice? * Finally, how can data statements be incorporated into metadata already associated with data sets, such as is called for by the CLARIN or META-SHARE schemas?

To ensure that the best practices developed are as broadly applicable as possible, we especially encourage participation from developers of datasets for low-resource languages and/or dataset developers from countries not well represented at major NLP conferences.


Emily M. Bender, University of Washington, Department of Linguistics Batya Friedman, University of Washington, Information School Angelina McMillan-Major, University of Washington, Department of Linguistics


We thank the Tech Policy Lab at the University of Washington for its support of this workshop.

-- Emily M. Bender (she/her) Howard and Frances Nostrand Endowed Professor Department of Linguistics Faculty Director, CLMS University of Washington Twitter: @emilymbender -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4825 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200417/c58c5a11/attachment.txt>

More information about the Corpora mailing list