SEW: a Wikipedia corpus with 200M sense annotations of 4M entities and concepts

Mon Jul 4 13:36:59 CEST 2016

We are pleased to announce the release of *SEW* (Semantically Enriched Wikipedia), a sense-annotated corpus, automatically built from Wikipedia, in which the overall number of linked mentions has been more than tripled solely by exploiting the hyperlink structure of Wikipedia pages and categories, along with the wide-coverage sense inventory of *BabelNet* ( http://babelnet.org).

SEW can be used both as a large-scale Wikipedia-based semantic network and as a sense-tagged dataset with more than *200 million* annotations of over *4 million *different concepts and named entities.

We release two different versions of the corpus, both created from the Wikipedia dump of November 2014, and stored in easy-to-process XML files:

- A "complete" version, with every discovered annotations (including duplicates and overlapping mentions);

- A "conservative" version, with only one sense annotation per tagged mention and no overlap.

We also release two *vector representations* constructed using SEW and used in the extrinsic evaluation of the corpus:

- *WB-SEW*, a vector representation for BabelNet synsets in which dimensions are Wikipedia pages;

- *SB-SEW*, a vector representation for Wikipedia pages in which dimensions are BabelNet synsets.

Please find all the above resources freely available for download at http://lcl.uniroma1.it/sew

*Reference paper (to appear):*

Alessandro Raganato, Claudio Delli Bovi and Roberto Navigli.

*Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia.*

Proceedings of 25th International Joint Conference on Artificial Intelligence (IJCAI-16), New York City, New York, USA, 9-15 July 2016.

Kind regards,

Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli.

Linguistic Computing Laboratory, Sapienza University of Rome

