Effective and Efficient Distributed Model-Based Clustering

Authors:
Hans-Peter Kriegel;Peer Kroger;Alexey Pryakhin;Matthias Schubert
Affiliations:
University of Munich;University of Munich;University of Munich;University of Munich
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 6
Cited 9

Distributed data clustering can be efficient and exact

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
On Clustering Validation Techniques

Journal of Intelligent Information Systems
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Scalable density-based distributed clustering

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases

A new unsupervised approach for fuzzy clustering

Fuzzy Sets and Systems
A scalable framework for cluster ensembles

Pattern Recognition
Lightweight clustering technique for distributed data mining applications

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Ensemble learning based distributed clustering

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Approximate pairwise clustering for large data sets via sampling plus extension

Pattern Recognition
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Objective function-based clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
GoSCAN: Decentralized scalable data clustering

Computing
Robust estimation of a global Gaussian mixture by decentralized aggregations of local models

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many companies data is distributed among several sites, i.e. each site generates its own data and manages its own data repository. Analyzing and mining these distributed sources requires distributed data mining techniques to find global patterns representing the complete information. The transmission of the entire local data set is often unacceptable because of performance considerations, privacy and security aspects, and bandwidth constraints. Traditional data mining algorithms, demanding access to complete data, are not appropriate for distributed applications. Thus, there is a need for distributed data mining algorithms in order to analyze and discover new knowledge in distributed environments. One of the most important data mining tasks is clustering which aims at detecting groups of similar data objects. In this paper, we propose a distributed model-based clustering algorithm that uses EM for detecting local models in terms of mixtures of Gaussian distributions. We propose an efficient and effective algorithm for deriving and merging these local Gaussian distributions to generate a meaningful global model. In a broad experimental evaluation we show that our framework is scalable in a highly distributed environment.