[Corpora-List] Multi-clustering text datasets

Eric Atwell E.S.Atwell at leeds.ac.uk
Mon Aug 31 13:37:22 CEST 2015


Corpus linguists are familiar with the idea of classifying or clustering texts in a corpus along more than one dimension; so a number of existing corpora could be used as "multi-clustering text datasets" for your machine learning experiments. For example, the British National Corpus Reference Guide states:

"... Texts are classified in several different ways in the BNC, as described in section 5.3.5 Text classification <http://www.natcorp.ox.ac.uk/docs/URG/cdifhd.html#hdpdtc> . Each text carries a number of text classification codes, specified as a string of values on the target attribute of its <catRefs> element ..."


Eric Atwell, I-AIBS Institute for Artificial Intelligence, University of Leeds UK

________________________________ From: corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of Sajib Dasgupta <sdgnew at gmail.com> Sent: 31 August 2015 10:37 To: corpora at uib.no Subject: [Corpora-List] Multi-clustering text datasets

Dear All,

We are proud to distribute Multi-clustering datasets, a subset of which were introduced in the following paper:

Mining Clustering Dimensions. Sajib Dasgupta and Vincent Ng. In the Proceedings of the International Conference on Machine Learning (ICML), 2010.

While traditional work on text clustering has largely focused on grouping documents by topic, it is conceivable that a user may want to cluster documents along other dimensions, such as the author's mood, gender, age or sentiment. This is useful as users often have a single clustering along a particular dimension in mind, but the fact that there could be 'alternative' ways to cluster the data may provide her important insights which were otherwise missing and could potentially be valuable.

Motivated in part by this observation, we take a multifaceted approach to document annotation: we annotate a set of documents across multiple dimensions, where each dimension represents a particular classification structure along which the document set can be meaningfully categorized.

We use the annotations as a gold-standard to evaluate an alternative (or multi-) clustering system, which seeks to organize, or cluster, a set of text documents along multiple dimensions.

We host a variety of document collections in the repository (12 in total) including blogs, reviews and opinionated articles, political discussions etc., each of which are annotated along at least two dimensions.

The datasets can be downloaded from here:



-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4464 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150831/1ea64914/attachment.txt>

More information about the Corpora mailing list