[Corpora-List] WikiReading: Three large machine reading datasets, in English, Russian and Turkish, published by Google Research

Tom Kenter tom.kenter at gmail.com
Wed Mar 14 22:48:12 CET 2018

In addition to an English machine reading dataset already published in 2016, Google Research recently added two new large sets, one in Russian and one in Turkish. All sets can be downloaded from http://goo.gl/wikireading.

Together, the datasets provide a unique collection for comparing machine reading algorithms on one task, across morphologically different languages.

The sets are based on Wikipedia, and consist of Wikipedia articles accompanied by key-value pairs representing knowledge about the entity the Wikipedia page is about. The task is to predict values for the keys, given the Wikipedia article. This is challenging as the key nor the value necessarily occurs in the document text verbatim.

The sets are large, which, in this case, means 16M training examples for the English set, plus 1.89M validation and 941K test. For Russian this is 4.26M training, 531K validation, 533K test, and for Turkish 655K training, 81.6K validation, 82.6K test.

The datasets come with results for many deep learning baseline algorithms, as described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 <https://arxiv.org/abs/1608.03542> (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 <http://tomkenter.nl/pdf/kenter_byte-level_2018.pdf> (the Turkish and Russian datasets).

For more information, see: http://goo.gl/wikireading -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3824 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180314/3614ce19/attachment.txt>

More information about the Corpora mailing list