P-AutoClass: Scalable Parallel Clustering for Mining Large Data Sets

Authors:
Clara Pizzuti;Domenico Talia
Affiliations:
-;IEEE Computer Society
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2003

Citing 18
Cited 10

Algorithms for clustering data

Algorithms for clustering data
Parallel algorithms for hierarchical clustering

Parallel Computing
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Models and languages for parallel computation

ACM Computing Surveys (CSUR)
Data clustering: a review

ACM Computing Surveys (CSUR)
Mining Very Large Databases with Parallel Processing

Mining Very Large Databases with Parallel Processing
Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures

IEEE Parallel & Distributed Technology: Systems & Technology
Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2

IEEE Parallel & Distributed Technology: Systems & Technology
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Bayesian Classification of Protein Structure

IEEE Expert: Intelligent Systems and Their Applications
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Parallel k/h-Means Clustering for Large Data Sets

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Evaluating and Modeling Communication Overhead of MPI Primitives on the Meiko CS-2

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Large-Scale Parallel Data Clustering

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume IV-Volume 7472 - Volume 7472

Parallel nearest neighbour clustering algorithm (PNNCA) for segmenting retinal blood vessels

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
Image-mapped data clustering: An efficient technique for clustering large data sets

Intelligent Data Analysis
A new scalable and efficient parallel algorithm (PRACAL) for clustering large datasets

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

Computational Statistics & Data Analysis
Parallelization of a hierarchical data clustering algorithm using OpenMP

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
DisClus: a distributed clustering technique over high resolution satellite data

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Scalable co-clustering algorithms

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval

Knowledge-Based Systems
Convex and concave hulls for classification with support vector machine

Neurocomputing
Fast classification for large data sets via random selection clustering and Support Vector Machines

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental performance results on different processor numbers and data sets are presented and compared with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.