[Corpora-List] New release of DGT-TM (parallel corpus in 23 languages)

Thomas Schoenemann thomas_schoenemann at yahoo.de
Thu Nov 8 14:40:10 CET 2012

Hi everyone,

out of curiosity, I would like to know the DGT-TM relates to the Europarl corpus (available through UEdinburgh). I had a look at the 2004 data, and it appears that the fragments here are often entire sentences, or even two or three sentences (probably as in one of the other languages it's just one sentence). So that would be like Europarl. But for Europarl, I know that a sentence aligner was used. So what's the difference?

Can anyone help?

Thanks!   Thomas


Von: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu> An: corpora at uib.no; ln at cines.fr; clef at dei.unipd.it; elsnet-list at elsnet.org; mt-list at eamt.org Gesendet: 17:04 Montag, 5.November 2012 Betreff: [Corpora-List] New release of DGT-TM (parallel corpus in 23 languages)

DGT-TM is an extraction of the translation memory of the European Institutions for all official EU languages, produced by the European Commission’s Directorate General for Translation (DGT) and distributed by the Joint Research Centre (JRC). Translation memories are sentences and their manually produced translations.   The new release is called DGT-TM-2012. It follows the previous releases, DGT-TM (2007) and DGT-TM-2011. DGT-TM-2012 adds over six million translation units to the previous 57 million translation units, resulting in almost 3.3 million sentences for most languages, 63 million translation units in total.   New features of DGT-TM-2012 are:   ·         Small amounts of Irish data is now included for the first time; ·         Significantly more data for the Bulgarian, Maltese and Romanian languages; ·         Mostly about 285K new translation units per language.   Languages:  All 253 language pairs involving the following 23 languages:                   Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian,                 Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak,                 Slovene, Spanish and Swedish.             URL:        http://langtech.jrc.ec.europa.eu/DGT-TM.html Creator:    European Commission - Directorate General for Translation (DGT)     WHAT IS DGT-TM   The ‘Acquis Communautaire’ is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 22 official languages. For the 23rd official EU language, Irish, the Acquis has not been translated on a regular basis; which is why DGT-TM includes only little data in Irish. The Acquis Communautaire was split into sentences and aligned automatically at sentence level, resulting in the DGT translation memory, DGT-TM. The text data is accompanied by software that allows to extract all sentences and their translations for any of the 253 possible language pair combinations.     MOTIVATION FOR THIS RELEASE   The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory in 2007, the multilingual named entity resource JRC-Names in 2011, the multilingual multi-label classification tool (and accompanying text data) JRC EuroVoc Indexer (JEX) (22 languages), and further smaller multilingual resources. See http://langtech.jrc.ec.europa.eu/JRC_Resources.html for more information on these resources.     WHAT DGT-TM CAN BE USED FOR                 DGT-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.     MORE INFORMATION ON DGT-TM   At http://langtech.jrc.ec.europa.eu/JRC_Publications.html , you find detailed publications on the JRC’s multilingual language technology activity. For details on DGT-TM, you can read:         Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos       & Patrick Schlüter (2012).       DGT-TM: A freely Available Translation Memory in 22 Languages.       Proceedings of the 8th international conference on Language       Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.                 http://langtech.jrc.ec.europa.eu/Documents/2012_LREC_DGT-TM_Final.pdf     WHAT NEXT?   The JRC and collaborating services of the European Commission are currently finalising the release of further large-scale linguistic resources.     Ralf Steinberger   European Commission - Joint Research Centre (JRC) 21027 Ispra (VA), Italy URL – Applications: http://emm.newsbrief.eu/overview.html URL – Resources: http://ipsc.jrc.ec.europa.eu/index.php?id=61    _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 17903 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121108/abbc77fd/attachment.txt>

More information about the Corpora mailing list