An information-theoretic analysis of hard and soft assignment methods for clustering

Authors:
Michael Kearns;Yishay Mansour;Andrew Y. Ng
Affiliations:
AT&T Labs Research, Florham Park, New Jersey;Tel Aviv University, Tel Aviv, Israel;Carnegie Mellon University, Pittsburgh, Pennsylvania
Venue:
UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence
Year:
1997

Citing 4
Cited 25

Introduction to the theory of neural computation

Introduction to the theory of neural computation
Elements of information theory

Elements of information theory
Fundamentals of speech recognition

Fundamentals of speech recognition
The EM algorithm for graphical association models with missing data

Computational Statistics & Data Analysis - Special issue dedicated to Toma´sˇ Havra´nek

Clustering techniques for large data sets—from the past to the future

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Alternatives to the k-means algorithm that find better clusterings

Proceedings of the eleventh international conference on Information and knowledge management
Learning Topological Maps from Sequential Observation and Action Data under Partially Observable Environment

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Performance Guarantees for Hierarchical Clustering

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Data mining tasks and methods: Clustering: conceptual clustering

Handbook of data mining and knowledge discovery
A unified framework for model-based clustering

The Journal of Machine Learning Research
Segmentation problems

Journal of the ACM (JACM)
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An information theoretic analysis of maximum likelihood mixture estimation for exponential families

ICML '04 Proceedings of the twenty-first international conference on Machine learning
On Weighting Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

The Journal of Machine Learning Research
A probabilistic framework for relational clustering

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bregman bubble clustering: A robust framework for mining dense clusters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Boosting for Model-Based Data Clustering

Proceedings of the 30th DAGM symposium on Pattern Recognition
Finding cohesive clusters for analyzing knowledge communities

Knowledge and Information Systems
Analyzing seller practices in a Brazilian marketplace

Proceedings of the 18th international conference on World wide web
Analyzing knowledge communities using foreground and background clusters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Discovering clusters in motion time-series data

CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
Reactivity-based approaches to improve web systems' quality of service

Journal of Web Engineering
A two-round variant of EM for Gaussian mixtures

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Clustering with a semantic criterion based on dimensionality analysis

ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II
Boosting GMM and its two applications

MCS'05 Proceedings of the 6th international conference on Multiple Classifier Systems
Lateen EM: unsupervised training with multiple objectives, applied to dependency grammar induction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Survey: Some results of Christos Papadimitriou on internet structure, network routing, and web information

Computer Science Review
Noise-enhanced clustering and competitive learning algorithms

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assignment methods are at the heart of many algorithms for unsupervised learning and clustering -- in particular, the well-known K-means and Expectation-Maximizatian (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the "soft" assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decomposition of the expected distortion, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters. How well the data are balanced is measured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the decomposition allows us to give a rather general argument showing that K-means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call posterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.