[Corpora-List] Release 2018 of DGT-Translation Memory (free parallel corpus in 24 languages)

Ralf.STEINBERGER at ec.europa.eu Ralf.STEINBERGER at ec.europa.eu
Fri Mar 23 15:57:43 CET 2018


We are happy to announce that the 2018 update release of the DGT-Translation Memory (DGT-TM) is now available for free download.

This year's release adds 6.8 million translation units (~ sentences) - or 122 million words - to the collection.

With this update, a total of 121 million translation units is now available for download, equivalent to over 2 billion words. More data for language pairs involving Maltese are available on request.

DGT-TM is an extraction of the translation memory of the European Institutions for all 24 official EU languages, produced by the European Commission's Directorate General for Translation (DGT) and distributed by the Joint Research Centre (JRC). Translation memories are sentences and their manually produced translations.

The new release is called DGT-TM-2018. It follows the original 2007 release DGT-TM and the yearly updates since 2011.

Languages: All 276 language pairs involving the following 24 languages:

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian,

German, Greek, Finnish, French, Irish, Hungarian, Italian,

Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian,

Slovak, Slovene, Spanish and Swedish.

URL: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory Creator: European Commission - Directorate General for Translation (DGT<http://ec.europa.eu/dgs/translation/index_en.htm>)

WHAT IS DGT-TM

The 'Acquis Communautaire<http://europa.eu/abc/eurojargon/index_en.htm>' is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 23 official languages. For the 24th official EU language, Irish, the Acquis has not been translated on a regular basis; which is why DGT-TM includes less data in Irish. The Acquis Communautaire was split into sentences and aligned automatically at sentence level, resulting in the DGT translation memory, DGT-TM. Small parts of the alignment data have been corrected by translators. The text data is accompanied by software that allows extracting all sentences and their translations for any of the 276 possible language pair combinations.

MOTIVATION FOR THIS RELEASE

The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of a number of further multilingual data sets:

· the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22 languages),

· the DGT-TM Translation Memory in 2007,

· the multilingual named entity resource JRC-Names in 2011 (and its Linked Data version in 2016),

· the multilingual multi-label classification tool (and accompanying text data) JRC EuroVoc Indexer (JEX) (22 languages) in 2012,

· the ECDC-TM Translation Memory in 2012 (domain: Public Health)

· the DGT-Acquis parallel corpus in 2012,

· the EAC-TM Translation Memory in 2013 (domain: Education and Culture),

· the DCEP (Digital Corpus of the European Parliament) in 2014,

· and further smaller multilingual resources.

See https://ec.europa.eu/jrc/en/language-technologies for more information on these resources.

WHAT DGT-TM CAN BE USED FOR

DGT-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.

MORE INFORMATION ON DGT-TM

At https://wt-public.emm4u.eu/Resources/JRC-EMM_Publications.pdf, you find detailed publications on the JRC's multilingual language technology activity<https://wt-public.emm4u.eu/Resources/JRC-EMM_Publications.pdf>. For details specifically on DGT-TM, you can read:

Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos

& Patrick Schlüter (2012).

DGT-TM: A freely Available Translation Memory in 22 Languages<http://www.lrec-conf.org/proceedings/lrec2012/pdf/814_Paper.pdf>.

Proceedings of the 8th international conference on Language

Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.

http://www.lrec-conf.org/proceedings/lrec2012/pdf/814_Paper.pdf

The following more recent article compares all freely available Language Technology resources distributed by the JRC and provides comparative background information:

Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel

Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014).

An overview of the European Union's highly multilingual parallel corpora<http://link.springer.com/article/10.1007/s10579-014-9277-0>.

Language Resources and Evaluation Journal (LRE).

DOI: 10.1007/s10579-014-9277-0.

(Read the manuscript<http://langtech.jrc.it/Documents/2014_08_LRE-Journal_JRC-Linguistic-Resources_Manuscript.pdf> at

https://ec.europa.eu/jrc/sites/jrcsh/files/2014_08_LRE-Journal_JRC-Linguistic-Resources_Manuscript.pdf).

----- Ralf Steinberger<https://ec.europa.eu/jrc/en/person/ralf-steinberger> European Commission - Joint Research Centre (JRC) I 03 - Competence Centre on Text Mining and Analysis<https://ec.europa.eu/jrc/en/text-mining-and-analysis> T.P. 267, Via E. Fermi 2749 21027 Ispra (VA), Italy URL - Resources: https://ec.europa.eu/jrc/en/language-technologies

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 30618 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180323/2d7c8bb7/attachment.txt>



More information about the Corpora mailing list