The high-activity parallel implementation of data preprocessing based on MapReduce

  • Authors:
  • Qing He;Qing Tan;Xudong Ma;Zhongzhi Shi

  • Affiliations:
  • The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Bei ...;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China and Graduate University of Chinese Academy of Sciences, Bei ...;The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

  • Venue:
  • RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data preprocessing is an important and basic technique for data mining and machine learning. Due to the dramatic increasing of information, traditional data preprocessing techniques are time-consuming and not fit for processing mass data. In order to tackle this problem, we present parallel data preprocessing techniques based on MapReduce which is a programming model to implement parallelization easily. This paper gives the implementation details of the techniques including data integration, data cleaning, data normalization and so on. The proposed parallel techniques can deal with large-scale data (up to terabytes) efficiently. Our experimental results show considerable speedup performances with an increasing number of processors.