[Corpora-List] CFP: ACL-IJCNLP-2009 Workshop on Comparable Corpora

Pierre Zweigenbaum pz
Wed Apr 22 15:45:05 CEST 2009


Second call for papers

2nd Workshop on Building and Using Comparable Corpora:

from parallel to non-parallel corpora

ACL-IJCNLP 2009 =======================================================================


August 6th, 2009

Suntec, Singapore


Deadline for submission: May 1st, 2009 ================================================


Following the success of the first Workshop on Building and Using

Comparable Corpora

<http://www.limsi.fr/~pz/lrec2008-comparable-corpora/> at LREC 2008,

this workshop aims to bring together language engineers as well as

linguists interested in the constitution and use of comparable

corpora, ranging from parallel to non-parallel corpora. In the larger

context of the joint ACL-IJCNLP, this workshop aims to solicit

contributions from researchers in different geographical regions, in

order to highlight in particular the issues with comparable corpora

across languages that are very different from each other, such as

across Asian and European languages. Research in minority languages

is also of particular interest.


Research in comparable corpora has been motivated by two main reasons

in the language engineering and the linguistics communities. In

language engineering, it is chiefly motivated by the need to use

comparable corpora as training data for statistical NLP applications

such as statistical machine translation or cross-lingual retrieval.

In linguistics, on the other hand, comparable corpora are of interest

themselves in providing intra-linguistic discoveries and

comparisons. It is generally accepted in both communities that

comparable corpora are documents in one to many languages, that are

comparable in content and form in various degrees and dimensions. It

was pointed out that parallel corpora are at one end of the spectrum

of comparability whereas quasi-comparable corpora are at the other

end. We believe that the linguistic definitions and observations in

comparable corpora can improve methods to mine such corpora for

applications to statistical NLP. As such, it is of great interest to

bring together builders and users of such corpora.

Parallel corpora are a key resource as training data for statistical

machine translation, and for building or extending bilingual lexicons

and terminologies. However, beyond a few language pairs such as

English-French or English-Chinese and a few contexts such as

parliamentary debates or legal texts, they remain a scarce resource,

despite the creation of automated methods to collect parallel corpora

from the Web. Interests in non-parallel forms of comparable corpora

in language engineering primarily ensued from the scarcity of

parallel corpora. This has motivated research into the use of

comparable corpora: pairs of monolingual corpora selected according

to the same set of criteria, but in different languages or language

varieties. Non-parallel yet comparable corpora overcome the two

limitations of parallel corpora, since sources for original,

monolingual texts are much more abundant than translated

texts. However, because of their nature, mining translations in

comparable corpora is much more challenging than in parallel

corpora. What constitutes a good comparable corpus, for a given task

or per se, also requires specific attention: while the definition of

a parallel corpus is fairly straightforward, building a non-parallel

corpus requires control over the selection of source texts in both


With the advent of online data, the potential for building and

exploring comparable corpora is growing exponentially. Comparable

documents in languages that are very different from each other pose

special challenges as very often, the non-parallel-ness in sentences

can result from cultural and political differences.


Kenneth Ward Church (Microsoft Research, Redmond)


We solicit contributions in but not limited to the following topics:

* Building Comparable Corpora

- Human translations

- Automatic and semi-automatic methods

- Methods to mine parallel and non-parallel corpora from the Web

- Tools and criteria to evaluate the comparability of corpora

- Parallel vs non-parallel corpora, monolingual corpora

- Rare and minority languages

- Across language families

- Multi-media/multi-modal comparable corpora

* Applications of Comparable Corpora

- Human translations

- Language learning

- Cross-language information retrieval & document categorization

- Bilingual projections

- Machine translation

- Writing assistance

* Mining from Comparable Corpora

- Extraction of parallel segments or paraphrases from

comparable corpora

- Extraction of bilingual and multilingual translations of

single words and multi-word expressions; proper names, named

entities, etc.


May 1, 2009 Paper submissions

Jun 1, 2009 Notification of acceptance

Jun 7, 2009 Camera-ready copies due

Aug 6, 2009 Workshop date


Please use the official style files for ACL/IJCNLP 2009 available at:




Pascale Fung, Hong Kong University of Science & Technology (HKUST)

Pierre Zweigenbaum, LIMSI-CNRS (France)

Reinhard Rapp, University of Mainz (Germany)

and University of Tarragona (Spain)


Hamdulla Askar(Xinjiang University, China)

Srinivas Bangalore (AT&T Labs, US)

Lynne Bowker (University of Ottawa, Canada)

Éric Gaussier (Université Joseph Fourier, Grenoble, France)

Gregory Grefenstette (Exalead, Paris, France)

Hitoshi Isahara (National Institute of Information and Communications

Technology, Japan)

Min-Ye Kan (National University of Singapore)

Adam Kilgarriff (Lexical Computing Ltd)

Philippe Langlais (Université de Montréal, Canada)

Rada Mihalcea (University of North Texas, US)

Dragos Stefan Munteanu (Language Weaver, Inc., US)

Grace Ngai (Hong Kong Polytechnic University, Hong Kong)

Carole Peters (ISTI-CNR, Pisa, Italy)

Serge Sharoff (University of Leeds, UK)

Richard Sproat (OGI School of Science & Technology, US)

Mandel Shi (Xiamen University, China)

Yujie Zhang (National Institute of Information and Communications

Technology, Japan)


Ricky Chan Ho Yin, Hong Kong University of Science & Technology

More information about the Corpora mailing list