Scalable Clustering Algorithms with Balancing Constraints

Authors:
Arindam Banerjee;Joydeep Ghosh
Affiliations:
Department of Computer Science and Engineering, University of Minnesota, Twin Cities, USA 55455;Department of Electrical and Computer Engineering, College of Engineering, University of Texas at Austin, Austin, USA 78712
Venue:
Data Mining and Knowledge Discovery
Year:
2006

Citing 0
Cited 11

Constraint-driven clustering

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Supporting social recommendations with activity-balanced clustering

Proceedings of the 2007 ACM conference on Recommender systems
A framework for condensation-based anonymization of string data

Data Mining and Knowledge Discovery
Large-Scale Clustering through Functional Embedding

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Compact representation of multimedia files for indexing, classification and retrieval

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Algorithms for K-means clustering problem with balancing constraint

CCDC'09 Proceedings of the 21st annual international conference on Chinese control and decision conference
Data clustering with size constraints

Knowledge-Based Systems
A cluster-level semi-supervision model for interactive clustering

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
GPU-based parallel collision detection for fast motion planning

International Journal of Robotics Research
Constrained clustering using SAT

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
A study of K-Means-based algorithms for constrained clustering

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (20,000) ones, are provided to demonstrate the efficacy of the proposed framework.