Data cloud for distributed data mining via pipelined mapreduce

Authors:
Zhiang Wu;Jie Cao;Changjian Fang
Affiliations:
Jiangsu Provincial Key Laboratory of E-Business, Nanjing University of Finance and Economics, Nanjing, P.R. China;Jiangsu Provincial Key Laboratory of E-Business, Nanjing University of Finance and Economics, Nanjing, P.R. China;Jiangsu Provincial Key Laboratory of E-Business, Nanjing University of Finance and Economics, Nanjing, P.R. China
Venue:
ADMI'11 Proceedings of the 7th international conference on Agents and Data Mining Interaction
Year:
2011

Citing 9
Cited 1

Google's MapReduce programming model – Revisited

Science of Computer Programming
Top 10 algorithms in data mining

Knowledge and Information Systems
Data mining using high performance data clouds: experimental studies using sector and sphere

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Privacy-preserving naive Bayes classification on distributed data via semi-trusted mixers

Information Systems
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
Agent Mining: The Synergy of Agents and Data Mining

IEEE Intelligent Systems
Ubiquitous Intelligence in Agent Mining

Agents and Data Mining Interaction
Domain-Driven Data Mining: Challenges and Prospects

IEEE Transactions on Knowledge and Data Engineering
Multi-agent information retrieval in heterogeneous industrial automation environments

ADMI'10 Proceedings of the 6th international conference on Agents and data mining interaction

A multi-agent data mining system for cartel detection in Brazilian government procurement

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed data mining (DDM) which often utilizes autonomous agents is a process to extract globally interesting associations, classifiers, clusters, and other patterns from distributed data. As datasets double in size every year, moving the data repeatedly to distant CPUs brings about high communication cost. In this paper, data cloud is utilized to implement DDM in order to move the data rather than moving computation. MapReduce is a popular programming model for implementing data-centric distributed computing. Firstly, a kind of cloud system architecture for DDM is proposed. Secondly, a modified MapReduce framework called pipelined MapReduce is presented. We select Apriori as a case study and discuss its implementation within MapReduce framework. Several experiments are conducted at last. Experimental results show that with moderate number of map tasks, the execution time of DDM algorithms (i.e., Apriori) can be reduced remarkably. Performance comparison between traditional and our pipelined MapReduce has shown that the map task and reduce task in our pipelined MapReduce can run in a parallel manner, and our pipelined MapReduce greatly decreases the execution time of DDM algorithm. Data cloud is suitable for a multitude of DDM algorithms and can provide significant speedups.