The high-activity parallel implementation of data preprocessing based on MapReduce

Authors:
Qing He;Qing Tan;Xudong Ma;Zhongzhi Shi
Affiliations:
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Bei ...;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Bei ...;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Year:
2010

Citing 4
Cited 2

Data mining: concepts and techniques

Data mining: concepts and techniques
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
Google's MapReduce programming model – Revisited

Science of Computer Programming
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008

HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel extreme learning machine for regression based on MapReduce

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data preprocessing is an important and basic technique for data mining and machine learning. Due to the dramatic increasing of information, traditional data preprocessing techniques are time-consuming and not fit for processing mass data. In order to tackle this problem, we present parallel data preprocessing techniques based on MapReduce which is a programming model to implement parallelization easily. This paper gives the implementation details of the techniques including data integration, data cleaning, data normalization and so on. The proposed parallel techniques can deal with large-scale data (up to terabytes) efficiently. Our experimental results show considerable speedup performances with an increasing number of processors.