[Corpora-List] DGT-TM - A freely available large-scale translation memory in 22 languages

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Fri Apr 13 15:47:26 CEST 2012

DGT-TM is a translation memory (sentences and their manually produced translations) in 22 languages.

Size: About 3 million sentences for most languages, 57 million in total.

Languages: All 231 language pairs involving the following 22 languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,

Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,

Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish.

URL: <http://langtech.jrc.ec.europa.eu/DGT-TM.html> http://langtech.jrc.ec.europa.eu/DGT-TM.html Creator: European Commission - Directorate General for Translation ( <http://ec.europa.eu/dgs/translation/index_en.htm> DGT)

The first version of DGT-TM (19 million sentences, or ‘Translation Units’) was released in 2007. This collection now triples in size through the addition of a further 38 million sentences. For the future, it is planned to release new data annually.


The ‘ <http://europa.eu/abc/eurojargon/index_en.htm> Acquis Communautaire’ is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 22 official languages. For the 23rd official EU language, Irish, the Acquis is not translated on a regular basis; which is why DGT-TM does not include data in Irish. The Acquis Communautaire was split into sentences and aligned automatically at sentence level, resulting in the DGT translation memory, DGT-TM. The text data is accompanied by software that allows to extract all sentences and their translations for any of the 231 possible language pair combinations.


The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory in 2007, the multilingual named entity resource JRC-Names in 2011, and further smaller multilingual resources. See http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information on these resources.


DGT-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.


At http://langtech.jrc.ec.europa.eu/, you find more information on the JRC’s multilingual language technology activity, download links for DGT-TM, as well as a page pointing to other multilingual resources. For details on DGT-TM, you can read:

Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos

& Patrick Schlüter (2012).

<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf> DGT-TM: A freely Available Translation Memory in 22 Languages.

Proceedings of the 8th international conference on Language

Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.

<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf> http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf


The JRC and collaborating services of the European Commission plan to release further large-scale linguistic resources in the near future. The JRC EuroVoc Indexer Software JEX to multi-label categorise documents automatically according to the large-scale subject domain classification scheme EuroVoc will be released in May 2012.

Ralf Steinberger European Commission - Joint Research Centre (JRC) 21027 Ispra (VA), Italy URL – Applications: <http://emm.newsbrief.eu/overview.html> http://emm.newsbrief.eu/overview.html URL – The science behind them: <http://langtech.jrc.ec.europa.eu/> http://langtech.jrc.ec.europa.eu/

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 33480 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120413/2e86d7a1/attachment.txt>

More information about the Corpora mailing list