[Corpora-List] Summary: Corpus of translated material

Nomi Guthmann nomi.guthmann at googlemail.com
Thu Mar 8 13:48:00 CET 2007

Dear corpora list members,

Here is the summary of the various responses on corpora of translated
material (the main requirement was to know the source language of the
translations) :

The EUROPARL corpus
In its current form, it does not include information of the source
language of the various texts, but I was told that its next release

The English-Estonian and Estonian-English parallel corpus :
It includes Estonian laws and EU legislation, and their translation.

The INTERSECT corpus
It includes English-French, English-German translations in several domains.

The COMPARA corpus
It includes English and Portuguese bi-directional parallel texts.

The OPUS corpus
It is an open source parallel corpus in several languages.
Jörg Tiedemann also has a corpus of aligned movie subtitles, available
for research purposes only.

The TEC corpus
A large corpus of translated English.

The Bible corpus

Corina Forascu has a section of the TimeBank 1.2 (English) corpus
translated into Romanian.

JRC-Acquis multilingual parallel corpus
A parallel corpus in several languages. The source languages in this
corpus are unknown.

The CroCo project
Corpus of German and English translations. The corpus is not available
for copyright reasons.

Many thanks for responses:
Chris Callison-Burch
Israel Cohen
Corina Forascu
Ana Frankenberg-Garcia
Hieu Hoang
Heiki Kaalep
Andrea Mulloni
Stella Neumann
Sebastian Padó
Raphael Salkie
Armin Schmidt
Harold Somers
Ralf Steinberger
Jörg Tiedemann

Noemie Guthmann
Translation and Interpreting Studies Department
Bar Ilan University

More information about the Corpora-archive mailing list