Simultaneous Unsupervised Learning of Disparate Clusterings

Authors:
Prateek Jain;Raghu Meka;Inderjit S. Dhillon
Affiliations:
Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA;Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA;Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA
Venue:
Statistical Analysis and Data Mining
Year:
2008

Citing 0
Cited 6

Learning non-redundant codebooks for classifying complex objects

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Unifying dependent clustering and disparate clustering for non-homogeneous data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of the PMML standard to subspace clustering models

Proceedings of the 2011 workshop on Predictive markup language modeling
Model-based multidimensional clustering of categorical data

Artificial Intelligence
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Multi-view clustering using mixture models in subspace projections

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most clustering algorithms produce a single clustering for a given dataset even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally unsupervised manner. We propose two new approaches for this problem. In the first approach, we aim to find good clusterings of the data that are also decorrelated with one another. To this end, we give a new and tractable characterization of decorrelation between clusterings, and present an objective function to capture it. We provide an iterative “decorrelated” k-means type algorithm to minimize this objective function. In the second approach, we model the data as a sum of mixtures and associate each mixture with a clustering. This approach leads us to the problem of learning a convolution of mixture distributions. Though the latter problem can be formulated as one of factorial learning 8, 13, 16, the existing formulations and methods do not perform well on many real high-dimensional datasets. We propose a new regularized factorial-learning framework that is more suitable for capturing the notion of disparate clusterings in modern, high-dimensional datasets. Furthermore, we provide kernelized version of both of our algorithms. The resulting algorithms do well in uncovering multiple clusterings, and are much improved over existing methods. We evaluate our methods on two real-world datasets—a music dataset from the text-mining domain, and a portrait dataset from the computer-vision domain. Our methods achieve a substantially higher accuracy than existing factorial learning as well as traditional clustering algorithms. Copyright © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2008