[Corpora-List] CFP: 8th BUCC Workshop at ACL and Shared Task on Identifying Comparable Texts

Pierre Zweigenbaum pz at limsi.fr
Tue Feb 10 16:03:01 CET 2015


Co-located with ACL 2015 Beijing (China) 30 July 2015

Deadline for papers: 8 May 2015 Website: http://comparable.limsi.fr/bucc2015/

SHARED TASK: Identification of comparable texts (see below) Deadline for runs: 24 April 2015 Website: http://comparable.limsi.fr/bucc2015/bucc2015-task.html


8 May 2015 Deadline for submission of full papers

4 June 2015 Notification of acceptance

21 June 2015 Camera-ready papers due

30 July 2015 Workshop date


In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible intra-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.


We solicit contributions including but not limited to the following topics.

Building Comparable Corpora: • Human translations • Automatic and semi-automatic methods • Methods to mine parallel and non-parallel corpora from the Web • Tools and criteria to evaluate the comparability of corpora • Parallel vs non-parallel corpora, monolingual corpora • Rare and minority languages, across language families • Multi-media/multi-modal comparable corpora

Applications of comparable corpora: • Human translations • Language learning • Cross-language information retrieval & document categorization • Bilingual projections • Machine translation • Writing assistance • Machine learning techniques using comparable corpora

Mining from Comparable Corpora: • Induction of morphological, grammatical, and translation rules

from comparable corpora • Extraction of parallel segments or paraphrases from comparable

corpora • Extraction of bilingual and multilingual translations of single

words and multi-word expressions, proper names, and named

entities from comparable corpora • Induction of multilingual word classes from comparable corpora • Cross-language distributional semantics

Submission Information

See BUCC 2015 website: http://comparable.limsi.fr/bucc2015/bucc2015-cfp.html


Pierre Zweigenbaum LIMSI, CNRS, Orsay (France), Chair

Serge Sharoff University of Leeds (UK), Shared Task Chair

Reinhard Rapp University of Mainz (Germany)


Ahmet Aker, University of Sheffield (UK) Srinivas Bangalore (AT&T Labs, US) Caroline Barrière (CRIM, Montréal, Canada) Hervé Déjean (Xerox Research Centre Europe, Grenoble, France) Kurt Eberle (Lingenio, Heidelberg, Germany) Andreas Eisele (European Commission, Luxembourg) Éric Gaussier (Université Joseph Fourier, Grenoble, France) Gregory Grefenstette (INRIA, Saclay, France) Silvia Hansen-Schirra (University of Mainz, Germany) Hitoshi Isahara (Toyohashi University of Technology) Kyo Kageura (University of Tokyo, Japan) Adam Kilgarriff (Lexical Computing Ltd, UK) Natalie Kübler (Université Paris Diderot, France) Philippe Langlais (Université de Montréal, Canada) Michael Mohler (Language Computer Corp., US) Emmanuel Morin (Université de Nantes, France) Dragos Stefan Munteanu (Language Weaver, Inc., US) Lene Offersgaard (University of Copenhagen, Denmark) Ted Pedersen (University of Minnesota, Duluth, US) Reinhard Rapp (Université Aix-Marseille, France)shared.bucc2015 at gmail.com Sujith Ravi (Google, US) Serge Sharoff (University of Leeds, UK) Michel Simard (National Research Council Canada) Tim Van de Cruys (IRIT-CNRS, Toulouse, France) Stephan Vogel, QCRI (Qatar) Guillaume Wisniewski (Université Paris Sud & LIMSI-CNRS, Orsay, France) Pierre Zweigenbaum (LIMSI-CNRS, Orsay, France)

================ SHARED TASK

A shared task is organized together with the workshop. This will be the first evaluation exercise on the identification of comparable texts: given a large multilingual collection of texts (we will be using Wikipedia documents in several languages), the task is to identify the most similar texts across languages. Evaluation will be done by measuring precision, recall and F-measure on links between pages, with a gold standard based on actual inter-language links.

Task description

Parallel corpora of original texts with their translations provide the basis for multilingual NLP applications since the beginning of the 1990s. Relative scarcity of such resources led to greater attention to comparable (=less parallel) resources to mine information about possible translations. Many studies have been produced within the paradigm of comparable corpora, including publications in the BUCC workshop series since 2008, see bucc-introduction.html.

However, the community so far has not conducted an evaluation which compared different approaches for identifying more or less parallel resources in a large amount of multilingual data. Also, it is not clear how language-specific such approaches are. In this shared task we propose the first evaluation exercise, which is aimed at detecting the most similar texts in a large collection.

Data set

The data for each language pair has been split into two sets:

* pages with information about the correct links for the respective

language pairs;

* pages without the links.

The task is for each page in the test set to submit up to five ranked suggestions to its linked page, assuming that the gold standard contains its counterpart in another language. The submissions will have to be in the tab-separated format as used in the submissions to TREC with six fields:

id1 X id2 Y score run.name

The X and Y fields are not used, but they are reserved by the TREC evaluation script (and it does not use them either). Please keep them with constant values X and Y. id1 and id2 are the articles ids in a language of evaluation and in English. The score should reflect the similarity between id1 and id2, the higher the closer. The participants are invited to submit up to five runs of their system with different parameters, as identified by a keyword in the last field. This field should include the name of the team and an identifier for the run, e.g., Leeds.run1, or LIMSI.BM25. For the evaluation script and for more information about the format, please visit: http://trec.nist.gov/trec_eval/

The languages in the shared task will be Chinese, French, German, Russian and Turkish. Pages in these languages need to be linked to a page in English.

Submission procedure

Please register by sending a message to shared.bucc2015 at gmail.com and giving the name of the contact person, and the language pairs you'd like to work on.

In response you will receive links to the training sets and the scoring script.

Task deadlines

1 February 2015 Training set available

20 April 2015 Test set available

24 April 2015 Test submission deadline

1 May 2015 System results to participants

8 May 2015 Paper submission deadline

4 June 2015 Notification of acceptance

21 June 2015 Camera-ready papers due

More information about the Corpora mailing list