[Corpora-List] bilingual labeled corpora

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Wed May 16 13:26:32 CEST 2012

Dear Germán,

We just released (today - the email is on its way!) a multi-label classification tool which has been trained for 22 languages and which comes with manually annotated topic descriptors, drawn from the EuroVoc thesaurus. The multi-label annotation is at document level. There are between twenty and forty thousand documents per language.

You can find it at http://langtech.jrc.ec.europa.eu/Eurovoc.html .

Maybe this corpus is useful for you.

Should you be seeking for individual aligned sentences, then may be the DGT-Translation Memory DGT-TM is what you are looking for. While the sentences in DGT-TM are not individually annotated, they are accompanied by a document identifier so that - with a bit of effort - you can retrieve the EuroVoc descriptors for these documents. DGT-TM exists in the same 22 languages and is downloadable from http://langtech.jrc.ec.europa.eu/DGT-TM.html .



Ralf Steinberger (Ralf.Steinberger at jrc.ec.europa.eu) European Commission – Joint Research Centre (JRC) IPSC – GlobeSec – OPTIMA URL – Applications: http://emm.newsbrief.eu/overview.html URL – The science behind them: http://langtech.jrc.ec.europa.eu T.P. 267, Via E. Fermi 2749 21027 Ispra (VA), Italy Tel: +39 0332 78-6271 Fax: +39 0332 78-5154 Secretary: +39 0332 78-5648 or 9478

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Germán Sanchis Trilles Sent: 16 May 2012 12:56 To: CORPORA at uib.no Subject: [Corpora-List] bilingual labeled corpora

Dear list,

for performing some SMT experiments I would require some kind of bilingual corpora, presenting different kind of annotations, such as topic or dialog act labels (or other kinds of labels). Does anyone know about such corpora?

Thanks in advance,

best regards,

Germán Sanchis-Trilles

More information about the Corpora mailing list