[Corpora-List] [Release] Large Multilingual Corpus of Sense-Annotated Textual Definitions

Jose Camacho Collados collados at di.uniroma1.it
Mon Apr 11 11:44:39 CEST 2016


We are pleased to announce the release of a large corpus of sense-annotated textual definitions for 263 languages. To the best of our knowledge, this is the largest available corpus of its kind, with more than 38 million definitions from various resources (WordNet, Wikipedia, Wikidata, Open Multilingual WordNet, Wiktionary and OmegaWiki) and almost 250 million sense annotations. All definitions have been automatically disambiguated by exploiting at best their cross-language and cross-resource complementarities using BabelNet (http://babelnet.org), the largest multilingual encyclopedic dictionary and semantic network, Babelfy, a state-of-the-art multilingual Word Sense Disambiguation and Entity Linking system (http://babelfy.org), and the semantic vector representations of NASARI (http://lcl.uniroma1.it/nasari).

We release two different versions of the corpus, both stored in easy-to-process XML files divided by language and resource. The first version (“complete”) has been fully disambiguated for all content words and named entities with an estimated precision above 75% for most languages. The second version (“high-precision”) has a reduced coverage (around 65% for all content words and 75% for noun instances) but a higher precision (estimated above 90%).

All the resources are freely available for download at http://lcl.uniroma1.it/disambiguated-glosses

Reference paper:

Josť Camacho-Collados, Claudio Delli Bovi, Alessandro Raganato and Roberto Navigli. A Large-Scale Multilingual Disambiguation of Glosses. In Proceedings of LREC 2016 (to appear), Portorož, Slovenia, 23-28 May 2016.

Kind regards,

Josť Camacho Collados, Claudio Delli Bovi, Alessandro Raganato and Roberto Navigli.

Linguistic Computing Laboratory, Sapienza University of Rome -- Josť Camacho Collados Linguistic Computing Laboratory (LCL) Sapienza University of Rome http://wwwusers.di.uniroma1.it/~collados/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8674 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160411/a65536f3/attachment.txt>



More information about the Corpora mailing list