Unsupervised clustering of multidimensional distributions using earth mover distance

Authors:
David Applegate;Tamraparni Dasu;Shankar Krishnan;Simon Urbanek
Affiliations:
AT&T Labs - Research, Florham Park, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA;AT&T Labs - Research, Florham Park, NJ, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 8
Cited 4

The Earth Mover's Distance as a Metric for Image Retrieval

International Journal of Computer Vision
Signature-Based Methods for Data Streams

Data Mining and Knowledge Discovery
Data Signatures and Visualization of Scientific Data Sets

IEEE Computer Graphics and Applications
A Metric for Distributions with Applications to Image Databases

ICCV '98 Proceedings of the Sixth International Conference on Computer Vision
A new Mallows distance based metric for comparing clusterings

ICML '05 Proceedings of the 22nd international conference on Machine learning
An Efficient Earth Mover's Distance Algorithm for Robust Histogram Comparison

IEEE Transactions on Pattern Analysis and Machine Intelligence
Combinatorial Optimization: Theory and Algorithms

Combinatorial Optimization: Theory and Algorithms
On Divergences and Informations in Statistics and Information Theory

IEEE Transactions on Information Theory

Human mobility modeling at metropolitan scales

Proceedings of the 10th international conference on Mobile systems, applications, and services
Statistical distortion: consequences of data cleaning

Proceedings of the VLDB Endowment
Analyzing the composition of cities using spatial clustering

Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing
Model-based clustering of probability density functions

Advances in Data Analysis and Classification

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multidimensional distributions are often used in data mining to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution. In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we propose an approximation that reduces the cost to linear time. Given the large size of our dataset a fast approximation is crucial for this application. Other notions of distances such as the information theoretic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of "nearness". Our technique is simple, efficient and practical for clustering distributions. We demonstrate the use of EMD on a real-world application of analyzing 411,550 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent relationships in this space, and enables us to successfully cluster even sparse signatures.