Quick numbers: 237,702 lyrics in bag-of-words format, top 5,000 words provided.
This is the largest lyrics dataset ever released for research (to our knowledge). It is useful on its own, but all the bags-of-words are also directly resolved to MSD tracks, which links them to metadata such as: artist name, song title, release year, similar artists, tags, audio features, etc...
We are extremely grateful for the generous donation of this data, and aid in preparation, by www.musixmatch.com
The data is clean, meaning that we have removed all known duplicates and instrumental songs. We also provide you with the musiXmatch track ID so you can verify the information yourself. The data comes split into train and test sets to encourage the reporting of comparable results, even on learning-based tasks.
Although we have worked hard on this release, we cannot claim it is perfect. We welcome questions, feedback, error reports, ... Finally, try singing bags-of-words, now that's a challenge!
Thierry Bertin-Mahieux for the Million Song Dataset team, in collaboration with musiXmatch.com http://labrosa.ee.columbia.edu/millionsong/musixmatch