Parallel K-Means Clustering Based on MapReduce

Authors:
Weizhong Zhao;Huifang Ma;Qing He
Affiliations:
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, and Graduate University of Chinese Academy of Sciences,;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, and Graduate University of Chinese Academy of Sciences,;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences,
Venue:
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Year:
2009

Citing 8
Cited 13

Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor

Journal of Documentation
Parallel algorithms for hierarchical clustering

Parallel Computing
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Google's MapReduce programming model – Revisited

Science of Computer Programming
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008

Parallelization of K-means clustering on multi-core processors

ACS'10 Proceedings of the 10th WSEAS international conference on Applied computer science
Parallel K-means clustering of remote sensing images based on mapreduce

WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
Cloud-based malware detection for evolving data streams

ACM Transactions on Management Information Systems (TMIS)
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Scalable k-means++

Proceedings of the VLDB Endowment
A parallel method for computing rough set approximations

Information Sciences: an International Journal
A multi-agent data mining system for cartel detection in Brazilian government procurement

Expert Systems with Applications: An International Journal
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Compression-aware I/O performance analysis for big data clustering

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Parallel decision tree with application to water quality data analysis

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
Evaluating the use of clustering for automatically organising digital library collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
p-PIC: Parallel power iteration clustering for big data

Journal of Parallel and Distributed Computing
MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.