Clustering data with measurement errors

Authors:
Mahesh Kumar;Nitin R. Patel
Affiliations:
Rutgers Business School, Rutgers University, Newark and New Brunswick, 180 University Avenue, Newark, NJ 07102, USA;Operations Research Center, Massachusetts Institute of Technology, USA and C.T.O. of Cytel Software, 675 Massachusetts Avenue, Cambridge, MA 02138, USA
Venue:
Computational Statistics & Data Analysis
Year:
2007

Citing 8
Cited 2

Algorithms for clustering data

Algorithms for clustering data
An approach of clustering data with noisy or imprecise feature measurement

Pattern Recognition Letters
Trajectory clustering with mixtures of regression models

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Curvilinear Features in Spatial Point Patterns: Principal Curve Clustering with Noise

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distance Measures for Effective Clustering of ARIMA Time-Series

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Clustering seasonality patterns in the presence of errors

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

A transaction pattern analysis system based on neural network

Expert Systems with Applications: An International Journal
K-means in space: a radiation sensitivity evaluation

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning

Quantified Score

Hi-index	0.03

Visualization

Abstract

Traditional clustering methods assume that there is no measurement error, or uncertainty, associated with data. Often, however, real world applications require treatment of data that have such errors. In the presence of measurement errors, well-known clustering methods like k-means and hierarchical clustering may not produce satisfactory results. In this article, we develop a statistical model and algorithms for clustering data in the presence of errors. We assume that the errors associated with data follow a multivariate Gaussian distribution and are independent between data points. The model uses the maximum likelihood principle and provides us with a new metric for clustering. This metric is used to develop two algorithms for error-based clustering, hError and kError, that are generalizations of Ward's hierarchical and k-means clustering algorithms, respectively. We discuss types of clustering problems where error information associated with the data to be clustered is readily available and where error-based clustering is likely to be superior to clustering methods that ignore error. We focus on clustering derived data (typically parameter estimates) obtained by fitting statistical models to the observed data. We show that, for Gaussian distributed observed data, the optimal error-based clusters of derived data are the same as the maximum likelihood clusters of the observed data. We also report briefly on two applications with real-world data and a series of simulation studies using four statistical models: (1) sample averaging, (2) multiple linear regression, (3) ARIMA models for time-series, and (4) Markov chains, where error-based clustering performed significantly better than traditional clustering methods.