[Corpora-List] BUCC Shared Task: Bilingual term alignment in comparable corpora

Pierre Zweigenbaum pz at lisn.fr
Thu Mar 10 09:55:40 CET 2022

BUCC 2022 SHARED TASK: bilingual term alignment in comparable specialized corpora https://comparable.limsi.fr/bucc2022/bucc2022-task.html 23 Mar 2022 Deadline for submission of system runs 25 Jun 2022 Workshop at LREC 2022, Marseille, France

Through the 2022 BUCC shared task, we seek to evaluate methods that detect pairs of terms that are translations of each other in two comparable corpora, with an emphasis on multi-word terms in specialized domains.


Given a test dataset with comparable corpora C1 and C2, and lists of terms D1 and D2, participant systems are expected to produce an ordered list of term pairs in (D1, D2) that are translations of each other, in descending order of confidence.

Sample and training datasets are provided on the shared task page. When reporting their results, participants are required to specify which resources they used. They are also encouraged to test conditions in which they only use the provided resources.

The evaluation metric will be the Average Precision of the predicted bilingual term pair list, where the relevance of a term pair is determined by its presence in the (hidden) gold standard dictionary D1,2.


13 Feb 2022 Training data release (done) 16 Mar 2022 Test data release 23 Mar 2022 Submission of system runs by participants 30 Mar 2022 Evaluation sent to participants 10 Apr 2022 Submission of shared task papers to the BUCC workshop 25 Jun 2022 Workshop


The BUCC 2022 shared task is on multilingual terminology alignment in comparable corpora. Many research groups are working on this problem using a wide variety of approaches. However, as there is no standard way to measure the performance of the systems, the published results are not comparable and the pros and cons of the various approaches are not clear. The shared task aims at solving these problems by organizing a fair comparison of systems. This is accomplished by providing corpora and evaluation datasets for a number of language pairs and domains.

Moreover, the importance of dealing with multi-word expressions in Natural Language Processing applications has been recognized for a long time. In particular, multi-word expressions pose serious challenges for machine translation systems because of their syntactic and semantic properties. Furthermore, multi-word expressions tend to be more frequent in domain-specific text, hence the need to handle them in tasks with specialized-domain corpora.

For further details see the shared task website at https://comparable.limsi.fr/bucc2022/bucc2022-task.html


Omar Adjali (Université Paris-Saclay, CNRS, LISN, Orsay, France) Emmanuel Morin (Nantes Université, LS2N, Nantes, France) Serge Sharoff (University of Leeds, United Kingdom) Reinhard Rapp (Athena R.C., Greece; Magdeburg-Stendal University of Applied Sciences and University of Mainz, Germany) Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay, France)

Shared task contact points: please send expressions of interest to:

omar (dot) adjali (at) universite-paris-saclay (dot) fr

CC emmanuel (dot) morin (at) ls2n (dot) fr

CC pz (at) lisn (dot) fr

More information about the Corpora mailing list