we are pleased to announce the release of a new dataset specifically developed for countering online hate speech. This dataset contains 5000 hate speech/counter narrative pairs, each annotated with the corresponding hate target.
The dataset has been collected using a specific human-in-the-loop strategy: we used a generative Language Model (GPT-2) to generate data and some experts that validated and post-edited the output. We then iteratively feed the data back to the LM to refine it and obtain new data.
- The resource is freely available for research purposes at the following link: https://github.com/marcoguerini/CONAN/#Multitarget-CONAN
- The dataset and the methodology are thoroughly described in: Fanton, M., Bonaldi, H., Tekiroğlu, S. S., & Guerini, M. (2021). "Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech". In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3226-3240).
Best Regards Marco Guerini
-- Marco Guerini, PhD www.marcoguerini.eu