[Corpora-List] bilingual labeled corpora

Germán Sanchis Trilles gsanchis at dsic.upv.es
Thu May 17 12:27:19 CEST 2012


Dear Ralf,

thank you very much for the information. I am looking into the corpora you pointed out, and I think they will actually be very useful for my research
:)

Best regards,

Germán Sanchis-Trilles

On Wed, 16 May 2012, Ralf Steinberger wrote:


> Dear Germán,
>
> We just released (today - the email is on its way!) a multi-label classification tool which has been trained for 22 languages and which comes with manually annotated topic descriptors, drawn from the EuroVoc thesaurus. The multi-label annotation is at document level. There are between twenty and forty thousand documents per language.
>
> You can find it at http://langtech.jrc.ec.europa.eu/Eurovoc.html .
>
> Maybe this corpus is useful for you.
>
> Should you be seeking for individual aligned sentences, then may be the DGT-Translation Memory DGT-TM is what you are looking for. While the sentences in DGT-TM are not individually annotated, they are accompanied by a document identifier so that - with a bit of effort - you can retrieve the EuroVoc descriptors for these documents. DGT-TM exists in the same 22 languages and is downloadable from http://langtech.jrc.ec.europa.eu/DGT-TM.html .
>
> Greetings,
>
> Ralf
>
>
> Ralf Steinberger (Ralf.Steinberger at jrc.ec.europa.eu)
> European Commission – Joint Research Centre (JRC)
> IPSC – GlobeSec – OPTIMA
> URL – Applications: http://emm.newsbrief.eu/overview.html
> URL – The science behind them: http://langtech.jrc.ec.europa.eu
> T.P. 267, Via E. Fermi 2749
> 21027 Ispra (VA), Italy
> Tel: +39 0332 78-6271
> Fax: +39 0332 78-5154
> Secretary: +39 0332 78-5648 or 9478
>
>
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Germán Sanchis Trilles
> Sent: 16 May 2012 12:56
> To: CORPORA at uib.no
> Subject: [Corpora-List] bilingual labeled corpora
>
> Dear list,
>
> for performing some SMT experiments I would require some kind of bilingual corpora, presenting different kind of annotations, such as topic or dialog act labels (or other kinds of labels). Does anyone know about such corpora?
>
> Thanks in advance,
>
> best regards,
>
> Germán Sanchis-Trilles
>
>



More information about the Corpora mailing list