[Corpora-List] EAC-TM - Another freely available translation memory, in 26 languages

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Wed Feb 6 01:25:47 CET 2013

EAC-TM is a translation memory (sentences and their manually produced translations) in 26 languages. It is a multilingual parallel corpus covering 325 language pairs.

Size: Up to 5100 translation units per language; 78,000 in total.

Languages: All 325 language pairs involving the following 26 languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German,

Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian,

Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese,

Romanian, Slovak, Slovene, Spanish, Swedish and Turkish.

URL: http://langtech.jrc.ec.europa.eu/EAC-TM.html

Creator: EC Directorate for Education and Culture <http://ec.europa.eu/dgs/education_culture/> (EAC <http://ec.europa.eu/dgs/education_culture/> ) and JRC


EAC-TM was produced by translating the English language form data for the EAC’s Lifelong Learning Programme (LLP) and the Youth in Action Programme of the European Commission’s Directorate General for Education and Culture (EAC). The results of the translation were stored in 25 bilingual translation memories. DG EAC and the JRC post-processed these by cleaning the data and by producing one alignment for all 26 languages, resulting in parallel data for 325 language pairs.

The underlying documents are thus form data in the field of education and culture.

The EAC Translation Memory <http://langtech.jrc.ec.europa.eu/EAC-TM.html> is much smaller than the other multilingual resources distributed in the past by the European Commission’s Joint Research Centre (JRC). Its main advantages are that (a) it covers even more languages and (b) it is based on texts from a very different domain (education and culture).


The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis <http://langtech.jrc.ec.europa.eu/JRC-Acquis.html> parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory <http://langtech.jrc.ec.europa.eu/DGT-TM.html> in 2007 and 2011, the multilingual named entity resource JRC-Names <http://langtech.jrc.ec.europa.eu/JRC-Names.html> in 2011, the multi-label classification software JRC EuroVoc Indexer JEX <http://langtech.jrc.ec.europa.eu/Eurovoc.html> in 22 languages in 2012,the ECDC-TM Translation Memory <http://ipsc.jrc.ec.europa.eu/?id=782> in 25 languages in 2012, the DGT-Acquis <http://ipsc.jrc.ec.europa.eu/?id=783> parallel corpus in 23 languages in 2012, and further smaller multilingual resources. See http://ipsc.jrc.ec.europa.eu/?id=61 for more information on these resources.


EAC-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.


The JRC and collaborating services of the European Commission hope to release further large-scale linguistic resources in the future.

<http://langtech.jrc.ec.europa.eu/RS.html> Ralf Steinberger & Mohamed Ebrahim European Commission - Joint Research Centre (JRC) 21027 Ispra (VA), Italy

URL – Applications: <http://emm.newsbrief.eu/overview.html> http://emm.newsbrief.eu/overview.html

URL – Publications on the science behind them: <http://langtech.jrc.ec.europa.eu/JRC_Publications.html> http://langtech.jrc.ec.europa.eu/JRC_Publications.html

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10220 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20130206/c976b661/attachment.txt>

More information about the Corpora mailing list