[Corpora-List] JEX - A freely available multi-label categorisation tool trained for 22 languages

Ralf Steinberger ralf.steinberger at jrc.ec.europa.eu
Wed May 16 12:57:23 CEST 2012


The <http://langtech.jrc.ec.europa.eu/Eurovoc.html> JRC EuroVoc Indexer JEX is readily trained multi-label categorisation software that assigns categories from the large-scale and wide-coverage EuroVoc Thesaurus <http://eurovoc.europa.eu/> (consisting of thousands of categories). JEX is being distributed together with its training data (twenty to forty thousand documents per language). JEX has been trained for 22 languages on mostly parallel text (texts and their professionally produced translations). You can re-train JEX with your own documents, and even using your own categorisation scheme. JEX provides a graphical user interface (GUI), a command line option for batch processing, as well as an API.


Languages: Readily trained for 22 languages, but trainable for many more:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,

Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,

Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish.

Language families: Germanic, Romance, Slavic, Hellenic, Finno-Ugric, Baltic and Semitic.

URL: http://langtech.jrc.ec.europa.eu/Eurovoc.html

Creator: European Commission – Joint Research Centre (JRC <http://langtech.jrc.ec.europa.eu/> )


JEX can be used fully automatically or as an interactive tool to support professional librarians in their work.

JEX has also many potential uses in the field of Computational Linguistics because it is highly multilingual and it lends itself to cross-lingual tasks:

• Use for multilingual classification experiments, e.g. to test the impact of different document representations, etc. (n-grams, lemmas, POS, word-sense disambiguation, …), across different languages and language families;

• Use as input to other text mining applications, e.g.

• Detect document translations (Pouliquen et al. 2004);

• Cross-lingual plagiarism detection (Potthast et al. 2010);

• Link related documents across languages (Pouliquen et al. 2008);

• Support the lexical choice in Machine Translation;

• Rank sentences in topic-specific summarisation;

• …


At http://langtech.jrc.ec.europa.eu/, you find more information on the JRC’s multilingual language technology activity, download links for the JRC EuroVoc Indexer JEX, as well as a page pointing to further freely available multilingual resources. For details on JEX and its performance, you can read the following publication, which you might also want to use for scientific references:

Steinberger Ralf, Mohamed Ebrahim & Marco Turchi (2012).

<http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf> JRC EuroVoc Indexer JEX - A freely available multi-label categorisation tool. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012. Available at : <http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf> http://langtech.jrc.ec.europa.eu/Documents/2012_LREC-JEX-final.pdf

Ralf Steinberger, Mohamed Ebrahim & Marco Turchi European Commission - Joint Research Centre (JRC) 21027 Ispra (VA), Italy

URL – Applications: <http://emm.newsbrief.eu/overview.html> http://emm.newsbrief.eu/overview.html

URL – The science behind them: <http://langtech.jrc.ec.europa.eu/> http://langtech.jrc.ec.europa.eu/

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 14968 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120516/e9c669b9/attachment.txt>

More information about the Corpora mailing list