A fast and recursive algorithm for clustering large datasets with k-medians

Authors:
Hervé Cardot;Peggy Cénac;Jean-Marie Monnez
Affiliations:
Institut de Mathématiques de Bourgogne, UMR 5584, Université de Bourgogne, 9 Avenue Alain Savary, 21078 Dijon, France;Institut de Mathématiques de Bourgogne, UMR 5584, Université de Bourgogne, 9 Avenue Alain Savary, 21078 Dijon, France;Institut Elie Cartan, UMR 7502, Nancy Université, CNRS, INRIA, B.P. 239-F 54506 Vandoeuvre lès Nancy Cedex, France
Venue:
Computational Statistics & Data Analysis
Year:
2012

Citing 7
Cited 0

Acceleration of stochastic approximation by averaging

SIAM Journal on Control and Optimization
Data clustering: a review

ACM Computing Surveys (CSUR)
Asymptotic Almost Sure Efficiency of Averaged Stochastic Algorithms

SIAM Journal on Control and Optimization
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
A simple and fast algorithm for K-medoids clustering

Expert Systems with Applications: An International Journal
Online wavelet-based density estimation for non-stationary streaming data

Computational Statistics & Data Analysis
A new and efficient k-medoid algorithm for spatial clustering

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part III

Quantified Score

Hi-index	0.03

Visualization

Abstract

Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. A new class of recursive stochastic gradient algorithms designed for the k-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions which are known to have better performances. A data-driven procedure that permits a fully automatic selection of the value of the descent step is also proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as k-means, trimmed k-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.