DGT-TM Translation Memory
Freely available
22 languages
231 language pairs
Format: TMX version 1
<http://langtech.jrc.it/DGT-TM.html> http://langtech.jrc.it/DGT-TM.html
The European Commission's Directorate General for Translation (DGT) and the Joint Research Centre (JRC) have made available a multilingual Translation Memory (sentences and their translations, in standard TMX format) for the 22 official European Union languages Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish.
This release follows the public release - in May 2006 - of the <http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis multilingual parallel corpus with sentence alignment for 231 language pairs and a total size of over 1 Billion words.
The data releases of DGT and JRC are in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.
The Translation Memory contains most, but not all of the Acquis Communautaire, which is the entire body of European legislation, including all the treaties, regulations and directives adopted by the European Union (EU) and the rulings of the European Court of Justice. Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation is translated into 22 official EU languages. For the 23rd official EU language, Irish, the Acquis is not translated on a regular basis.
A translation memory is a collection of small text segments and their translation. These segments can be sentences or sentence parts. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.
Both translation memories and parallel texts are an important linguistic resource that can be used for a variety of purposes, including:
training automatic systems for Statistical Machine Translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).
For usage conditions, details regarding the difference between <http://langtech.jrc.it/DGT-TM.html> DGT-TM and the <http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis, size information, downloading instructions, etc. go to <http://langtech.jrc.it/DGT-TM.html> http://langtech.jrc.it/DGT-TM.html.
Achim Blatt
Directorate General for Translation (DGT)
Unit DGT.R.3 Informatics ( <http://ec.europa.eu/dgs/translation/> http://ec.europa.eu/dgs/translation/)
Ralf Steinberger European Commission - Joint Research Centre (JRC) IPSC - SeS - Language Technology ( <http://langtech.jrc.it/> http://langtech.jrc.it)
The JRC's Language Technology group specialises in the development of highly multilingual text analysis tools and in cross-lingual applications. Many applications are accessible online, e.g.:
. <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news aggregation and analysis (19 languages); allows to navigate the news over time and across languages; trend analysis; collects information about people from the news; social network detection.
. <http://press.jrc.it/> NewsBrief: breaking news detection and display of the very latest thematic news from around the world; email alerting (22+ languages).
. <http://medusa.jrc.it/> MedISys Medical Information System: latest health-related news from around the world according to themes and diseases (22+ languages).
-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.uib.no/mailman/public/corpora/attachments/20071128/d896d64d/attachment.html