A statistical view of clustering performance through the theory of U-processes

Authors:
Stéphan Clémençon
Affiliations:
-
Venue:
Journal of Multivariate Analysis
Year:
2014

Citing 13
Cited 0

Bagging predictors

Machine Learning
Random Forests

Machine Learning
Ranking the Best Instances

The Journal of Machine Learning Research
Aggregating inconsistent information: Ranking and clustering

Journal of the ACM (JACM)
Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

The Journal of Machine Learning Research
Principles and Theory for Data Mining and Machine Learning

Principles and Theory for Data Mining and Machine Learning
Clustering Stability: An Overview

Foundations and Trends® in Machine Learning
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

The Journal of Machine Learning Research
Almost-everywhere algorithmic stability and generalization error

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory
The minimax distortion redundancy in empirical quantizer design

IEEE Transactions on Information Theory
Individual convergence rates in empirical vector quantizer design

IEEE Transactions on Information Theory
On the Performance of Clustering in Hilbert Spaces

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many clustering techniques aim at optimizing empirical criteria that are of the form of a U-statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U-processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the excess of clustering risk of the empirical minimizer is proved to be of the order O"P(1/n). A lower bound result shows that the rate obtained is optimal in a minimax sense. Based on recent results related to the tail behavior of degenerate U-processes, it is also shown how to establish tighter, and even faster, rate bounds under additional assumptions. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. Finally, it is explained how the theoretical results developed here can provide statistical guarantees for empirical clustering aggregation.