Finding approximate solutions to combinatorial problems with very large data sets using BIRCH

Authors:
Justin Harrington;Matias Salibián-Barrera
Affiliations:
Department of Statistics, University of British Columbia, Canada V6T 1Z2;Department of Statistics, University of British Columbia, Canada V6T 1Z2
Venue:
Computational Statistics & Data Analysis
Year:
2010

Citing 9
Cited 2

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A fast algorithm for the minimum covariance determinant estimator

Technometrics
A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data

Data Mining and Knowledge Discovery
Regression - Yet Another Clustering Method

Proceedings of the International Symposium on "Intelligent Information Systems X"
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Computing LTS Regression for Large Data Sets

Data Mining and Knowledge Discovery
Linear grouping using orthogonal regression

Computational Statistics & Data Analysis

Editorial: Second special issue on statistical algorithms and software

Computational Statistics & Data Analysis
Editorial: Special issue on variable selection and robust procedures

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.03

Visualization

Abstract

Computing estimators with good robustness properties generally requires solving highly complex optimization problems. The current state-of-the-art algorithms to find approximate solutions to these problems need to access the data set a large number to times and become unfeasible when the data do not fit in memory. In this paper the BIRCH algorithm is adapted to calculate approximate solutions to problems in this class. For data sets that fit in memory, this approach is able to find approximate Least Trimmed Squares (LTS) and Minimum Covariance Determinant (MCD) estimators that compare very well with those returned by the fast-LTS and fast-MCD algorithms, and in some cases is able to find a better solution (in terms of value of the objective function) than those returned by the fast- algorithms. This methodology can also be applied to the Linear Grouping Algorithm and its robust variant for very large datasets. Finally, results from a simulation study indicate that this algorithm performs comparably well to fast-LTS in simple situations (large data sets with a small number of covariates and small proportion of outliers) and does much better than fast-LTS in more challenging situations without requiring extra computational time. These findings seem to confirm that this approach provides the first computationally feasible and reliable approximating algorithm in the literature to compute the LTS and MCD estimators for data sets that do not fit in memory.