[Corpora-List] Multi-clustering text datasets

Sajib Dasgupta sdgnew at gmail.com
Mon Aug 31 11:37:36 CEST 2015


Dear All,

We are proud to distribute Multi-clustering datasets, a subset of which were introduced in the following paper:

Mining Clustering Dimensions. Sajib Dasgupta and Vincent Ng. In the Proceedings of the International Conference on Machine Learning (ICML), 2010.

While traditional work on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age or sentiment. This is useful as users often have a single clustering along a particular dimension in mind, but the fact that there could be 'alternative' ways to cluster the data may provide her important insights which were otherwise missing and could potentially be valuable.

Motivated in part by this observation, we take a multifaceted approach to document annotation: we annotate a set of documents across multiple dimensions, where each dimension represents a particular classification structure along which the document set can be meaningfully categorized.

We use the annotations as a gold-standard to evaluate an alternative (or multi-) clustering system, which seeks to organize, or cluster, a set of text documents along multiple dimensions.

We host a variety of document collections in the repository (12 in total) including blogs, reviews and opinionated articles, political discussions etc., each of which are annotated along at least two dimensions.

The datasets can be downloaded from here:

http://www.hlt.utdallas.edu/~sajib/multi-clusterings.html

Sajib -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1950 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150831/a348c056/attachment.txt>



More information about the Corpora mailing list