[Corpora-List] CFP: Building and Using Comparable Corpora at ACL'17, Vancouver, Canada

Serge Sharoff S.Sharoff at leeds.ac.uk
Sun Apr 16 20:20:19 CEST 2017

Dear all,

this is a reminder about Friday, 21 April, the coming deadline for submitting papers to the Building and Using Comparable Corpora Workshop, see full information below.

For participants in the shared task, the test data will be also released on Friday, 21 April, with the deadline for submitting the results on 28 April.

Best wishes, Serge


10th Workshop on Building and Using Comparable Corpora Shared task: detection of parallel sentences in Comparable Corpora

Important dates Workshop Submission deadline: 21 April, 2017 Workshop Notification: 19 May, 2017 Workshop Camera Ready: 26 May, 2017

Website: https://comparable.limsi.fr/bucc2017/

*Shared task: Identifying parallel sentences in comparable corpora*

We announce a new shared task for 2017. As is well known, a bottleneck in statistical machine translation is the scarceness of parallel resources for many language pairs and domains. Previous research has shown that this bottleneck can be reduced by utilizing parallel portions found within comparable corpora. These are useful for many purposes, including automatic terminology extraction and the training of statistical MT systems.

The aim of the shared task is to quantitatively evaluate competing methods for extracting parallel sentences from comparable monolingual corpora, so as to give an overview on the state of the art and to identify the best performing approaches.

Shared task sample set release: 6 February, 2017 Shared task training set release: 13 February, 2017 Shared task test set release: 21 April, 2017 Shared task test submission deadline: 28 April, 2017 Shared task camera ready papers: 26 May, 2017

Any submission to the shared task is expected to be accompanied by a short paper (4 pages plus references). This will be accepted for publication in the workshop proceedings automatically, although the submission will go via Softconf with the standard peer-review process.


In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible intra-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.


We solicit contributions including but not limited to the following topics.

Building Comparable Corpora: • Human translations • Automatic and semi-automatic methods • Methods to mine parallel and non-parallel corpora from the Web • Tools and criteria to evaluate the comparability of corpora • Parallel vs non-parallel corpora, monolingual corpora • Rare and minority languages, across language families • Multi-media/multi-modal comparable corpora

Applications of comparable corpora: • Human translations • Language learning • Cross-language information retrieval & document categorization • Bilingual projections • Machine translation • Writing assistance • Machine learning techniques using comparable corpora

Mining from Comparable Corpora: • Induction of morphological, grammatical, and translation rules

from comparable corpora • Extraction of parallel segments or paraphrases from comparable

corpora • Extraction of bilingual and multilingual translations of single

words and multi-word expressions, proper names, and named

entities from comparable corpora • Induction of multilingual word classes from comparable corpora • Cross-language distributional semantics

Submission Information

See BUCC 2017 website: http://comparable.limsi.fr/bucc2017/

Workshop organisers:

Serge Sharoff (University of Leeds, UK), Chair Pierre Zweigenbaum (LIMSI-CNRS, Orsay, France), Shared task organiser Reinhard Rapp (University of Mainz, Germany) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6296 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170416/cf6f4f94/attachment.txt>

More information about the Corpora mailing list