Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up: Demonstrated for Center-Based Data Clustering Algorithms

Authors:
Bin Zhang;Meichun Hsu;George Forman
Affiliations:
-;-;-
Venue:
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2000

Citing 7
Cited 5

Cluster identification algorithms for spin models—sequential and parallel

Concurrency: Practice and Experience
Vector quantization and signal compression

Vector quantization and signal compression
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Parallel computing: principles and practice

Parallel computing: principles and practice
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Recommender systems

Communications of the ACM
A programmer's guide to ZPL

A programmer's guide to ZPL

K-Harmonic Means - A Spatial Clustering Algorithm with Boosting

TSDM '00 Proceedings of the First International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers
Scalable information extraction for web queries

International Journal of Computational Science and Engineering
Distributed data mining patterns and services: an architecture and experiments

Concurrency and Computation: Practice & Experience
Data weighing mechanisms for clustering ensembles

Computers and Electrical Engineering
Effects of resampling method and adaptation on clustering ensemble efficacy

Artificial Intelligence Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fueled by advances in computer technology and online business, data collection is rapidly accelerating, as well as the importance of its analysis--data mining. Increasing database sizes strain the scalability of many data mining algorithms. Data clustering is one of the fundamental techniques in data mining solutions. The many clustering algorithms developed face new challenges with growing data sets. Algorithms with quadratic or higher computational complexity, such as agglomerative algorithms, drop out quickly. More efficient algorithms, such as K-Means EM with linear cost per iteration, still need work to scale up to large data sets. This paper shows that many parameter estimation algorithms, including K-Means, K-Harmonic Means and EM, can be recast without approximation in terms of Sufficient Statistics, yielding an superior speed-up efficiency. Estimates using today's workstations and local area network technology suggest efficient speed-up to several hundred computers, leading to effective scale-up for clustering hundreds of gigabytes of data. Implementation of parallel clustering has been done in a parallel programming language, ZPL. Experimental results show above 90% utilization.