Parallel decision tree with application to water quality data analysis

Authors:
Qing He;Zhi Dong;Fuzhen Zhuang;Tianfeng Shang;Zhongzhi Shi
Affiliations:
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China,Graduate University of Chinese Academy of Sciences, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China,Graduate University of Chinese Academy of Sciences, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China
Venue:
ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
Year:
2012

Citing 8
Cited 0

Parallel Formulations of Decision-Tree Classification Algorithms

Data Mining and Knowledge Discovery
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Google's MapReduce programming model – Revisited

Science of Computer Programming
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Parallel K-Means Clustering Based on MapReduce

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Decision tree is a popular classification technique in many applications, such as retail target marketing, fraud detection and design of telecommunication service plans. With the information exploration, the existing classification algorithms are not good enough to tackle large data set. In order to deal with the problem, many researchers try to design efficient parallel classification algorithms. Based on the current and powerful parallel programming framework -- MapReduce, we propose a parallel ID3 classification algorithm(PID3 for short). We use water quality data monitoring the Changjiang River which contains 17 branches as experimental data. As the data are time series, we process the data to attribute data before using the decision tree. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.