[Corpora-List] Multi-clustering text datasets

Sajib Dasgupta sdgnew at gmail.com
Tue Sep 1 05:06:48 CEST 2015

Thanks Eric for pointing us to BNC corpus. It would be excellent addition to our list of multiclustering datasests.

We're looking for a few more datasets that can be naturally clustered/classified along multiple dimensions. If you know of other datasets of this kind please let us know.



On Mon, Aug 31, 2015 at 4:37 AM, Eric Atwell <E.S.Atwell at leeds.ac.uk> wrote:

> Sajib,
> Corpus linguists are familiar with the idea of classifying or
> clustering texts in a corpus along more than one dimension; so a number of
> existing corpora could be used as "multi-clustering text datasets" for
> your machine learning experiments. For example, the British National Corpus
> Reference Guide states:
> "... Texts are classified in several different ways in the BNC, as
> described in section 5.3.5 Text classification
> <http://www.natcorp.ox.ac.uk/docs/URG/cdifhd.html#hdpdtc>. Each text
> carries a number of text classification codes, specified as a string of
> values on the *target* attribute of its <catRefs> element ..."
> http://www.natcorp.ox.ac.uk/docs/URG/codes.html#classcodes
> Eric Atwell, I-AIBS Institute for Artificial Intelligence, University of
> Leeds UK
> ------------------------------
> *From:* corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of
> Sajib Dasgupta <sdgnew at gmail.com>
> *Sent:* 31 August 2015 10:37
> *To:* corpora at uib.no
> *Subject:* [Corpora-List] Multi-clustering text datasets
> Dear All,
> We are proud to distribute Multi-clustering datasets, a subset of which
> were introduced in the following paper:
> Mining Clustering Dimensions.
> Sajib Dasgupta and Vincent Ng.
> In the Proceedings of the International Conference on Machine Learning
> (ICML), 2010.
> While traditional work on text clustering has largely focused on grouping
> documents by topic, it is conceivable that a user may want to cluster
> documents along other dimensions, such as the author's mood, gender, age or
> sentiment. This is useful as users often have a single clustering along a
> particular dimension in mind, but the fact that there could be
> 'alternative' ways to cluster the data may provide her important insights
> which were otherwise missing and could potentially be valuable.
> Motivated in part by this observation, we take a multifaceted approach to
> document annotation: we annotate a set of documents across multiple
> dimensions, where each dimension represents a particular classification
> structure along which the document set can be meaningfully categorized.
> We use the annotations as a gold-standard to evaluate an alternative (or
> multi-) clustering system, which seeks to organize, or cluster, a set of
> text documents along multiple dimensions.
> We host a variety of document collections in the repository (12 in total)
> including blogs, reviews and opinionated articles, political discussions
> etc., each of which are annotated along at least two dimensions.
> The datasets can be downloaded from here:
> http://www.hlt.utdallas.edu/~sajib/multi-clusterings.html
> Sajib
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5403 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150831/518a5f9d/attachment.txt>

More information about the Corpora mailing list