[Corpora-List] Yahoo News Feed dataset

Tristan Miller miller at ukp.informatik.tu-darmstadt.de
Fri Apr 29 14:46:23 CEST 2016

Greetings, all.

Back in January of this year Yahoo announced with great fanfare [1] the release of a 13.5 TB dataset. This "Yahoo News Feed dataset" was supposed to contain "a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate" and was billed as "the largest-ever machine learning dataset for researchers". The press release got picked up and covered by a number of science and tech news outlets.

A colleague of mine applied to Yahoo obtain this dataset back in January. He received a response only today -- Yahoo refused to release the data to him. On its home page [2] the dataset is now marked as "no longer available".

Was *anyone* here successful in obtaining the data from Yahoo? Not that I want to get a copy second-hand; I'm just curious as to whether the dataset was ever actually distributed to any researcher outside Yahoo. I've heard grumblings that this may have been nothing more than an empty PR stunt by Yahoo, as well as speculation that the dataset was withdrawn due to concerns about its anonymity [3].

Regards, Tristan

[1] https://finance.yahoo.com/news/yahoo-releases-largest-ever-machine-140000758.html

[2] https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=75

[3] http://motherboard.vice.com/read/yahoos-gigantic-anonymized-user-dataset-isnt-all-that-anonymous

-- Tristan Miller, Research Scientist Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universitšt Darmstadt Tel: +49 6151 162 5296 | Web: https://www.ukp.tu-darmstadt.de/

More information about the Corpora mailing list