A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Authors:
Inderjit S. Dhillon;Dharmendra S. Modha
Affiliations:
-;-
Venue:
Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Year:
1999

Citing 29
Cited 39

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Vector quantization and signal compression

Vector quantization and signal compression
Clustering algorithms

Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Using cluster analysis to classify time series

Conference proceedings on Interpretation of time series from nonlinear mechanical systems
Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
The communication software and parallel environment of the IBM SP2

IBM Systems Journal
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Programming with UNIX threads

Programming with UNIX threads
LogP: a practical model of parallel computation

Communications of the ACM
Scalable parallel data mining for association rules

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
SONIA: a service for organizing networked information autonomously

Proceedings of the third ACM conference on Digital libraries
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Clustering Algorithms

Clustering Algorithms
Mining Very Large Databases with Parallel Processing

Mining Very Large Databases with Parallel Processing
Scalable High Performance Computing for Knowledge Discovery and Data Mining

Scalable High Performance Computing for Knowledge Discovery and Data Mining
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Parallel Algorithms for Discovery of Association Rules

Data Mining and Knowledge Discovery
Effect of Data Distribution in Parallel Mining of Associations

Data Mining and Knowledge Discovery
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
Parallel Formulations of Decision-Tree Classification Algorithms

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Parallel Classification for Data Mining on Shared-Memory Multiprocessors

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A Branch and Bound Algorithm for Computing k-Nearest Neighbors

IEEE Transactions on Computers
Class visualization of high-dimensional data with applications

Computational Statistics & Data Analysis

Implementation Issues in the Design of I/O Intensive Data Mining Applications on Clusters of Workstations

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Requirements Analysis for Parallel KDD Systems

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Parallel Fuzzy c-Means Clustering for Large Data Sets

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Privacy-preserving Distributed Clustering using Generative Models

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A privacy-sensitive approach to distributed clustering

Pattern Recognition Letters - Special issue: Advances in pattern recognition
K-means clustering for optimal partitioning and dynamic load balancing of parallel hierarchical N-body simulations

Journal of Computational Physics
Effective and Efficient Distributed Model-Based Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Parallel Spectral Clustering

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Clustered Nyström method for large scale manifold learning and dimension reduction

IEEE Transactions on Neural Networks
Parallelization of K-means clustering on multi-core processors

ACS'10 Proceedings of the 10th WSEAS international conference on Applied computer science
Distributed antipole clustering for efficient data search and management in Euclidean and metric spaces

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
DisClus: a distributed clustering technique over high resolution satellite data

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Collaborative clustering of XML documents

Journal of Computer and System Sciences
An SMP soft classification algorithm for remote sensing

Proceedings of the 19th High Performance Computing Symposia
A local facility location algorithm for sensor networks

DCOSS'05 Proceedings of the First IEEE international conference on Distributed Computing in Sensor Systems
A study of optimal system for multiple-constraint multiple-container packing problems

IEA/AIE'06 Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems
Text clustering for peer-to-peer networks with probabilistic guarantees

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Parallel implementation of information retrieval clustering models

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Scalable co-clustering algorithms

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Improved response modeling based on clustering, under-sampling, and ensemble

Expert Systems with Applications: An International Journal
Privacy preserving clustering

ESORICS'05 Proceedings of the 10th European conference on Research in Computer Security
Data mining with parallel support vector machines for classification

ADVIS'06 Proceedings of the 4th international conference on Advances in Information Systems
Clustering distributed data streams in peer-to-peer environments

Information Sciences: an International Journal
Scalable k-means++

Proceedings of the VLDB Endowment
Automatic document organization in a p2p environment

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
A framework for Multi-Agent Based Clustering

Autonomous Agents and Multi-Agent Systems
Compression-aware I/O performance analysis for big data clustering

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Effective fuzzy semantic clustering scheme for decentralised network through multi-domain ontology model

International Journal of Metadata, Semantics and Ontologies
Speeding up k-Means algorithm by GPUs

Journal of Computer and System Sciences
Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks

Journal of Parallel and Distributed Computing
On the utility of abstraction in labeling actors in social networks

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Data weighing mechanisms for clustering ensembles

Computers and Electrical Engineering
Effective fuzzy semantic clustering scheme for decentralised network through multi-domain ontology model

International Journal of Metadata, Semantics and Ontologies
Evolutionary k-means for distributed data sets

Neurocomputing
Effects of resampling method and adaptation on clustering ensemble efficacy

Artificial Intelligence Review
GoSCAN: Decentralized scalable data clustering

Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.