[Corpora-List] Corpus for hierarchical classification

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Thu Feb 19 18:45:43 CET 2015

Dear Ivelina,

The document collection distributed with the JRC Eurovoc Indexer (JEX) <https://ec.europa.eu/jrc/en/language-technologies/jrc-eurovoc-indexer> has been manually categorised according to the Eurovoc thesaurus, which uses eight hierarchic levels.

The JEX document collection consists of mostly parallel documents in 22 languages. For most languages, there are about 41,000 documents.

You need to download the version 'Indexing and training' in order to get the text collections.

You find JEX and the documents at:


More details on the collection in the paper:

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012). JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC'2012 <http://www.lrec-conf.org/lrec2012/> ), pp. 798-805, Istanbul, 21-27 May 2012. (Read online <http://www.lrec-conf.org/proceedings/lrec2012/pdf/875_Paper.pdf> )

Let us know what results you achieve on this collection, if you decide to use it. Thanks.

All the best,


Ralf Steinberger

European Commission - Joint Research Centre (JRC)

URL - Applications: http://emm.newsbrief.eu/overview.html

Further multilingual linguistic resources: https://ec.europa.eu/jrc/en/language-technologies

21027 Ispra (VA), Italy

-----Original Message----- From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Ivelina Nikolova Sent: 19 February 2015 14:08 To: corpora at uib.no Subject: [Corpora-List] Corpus for hierarchical classification

Dear All,

I'm searching for a corpus suitable for training hierarchical classification models.

I'll be thankful if you have any suggestions.




Dr Ivelina Nikolova

Assistant at Linguistic Modelling Department Institute of Information and Communication Technologies Bulgarian Academy of Sciences


UNSUBSCRIBE from this page: <http://mailman.uib.no/options/corpora> http://mailman.uib.no/options/corpora

Corpora mailing list

<mailto:Corpora at uib.no> Corpora at uib.no

<http://mailman.uib.no/listinfo/corpora> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7244 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150219/11850596/attachment.txt>

More information about the Corpora mailing list