iMapReduce: A Distributed Computing Framework for Iterative Computation

  • Authors:
  • Yanfeng Zhang;Qixin Gao;Lixin Gao;Cuirong Wang

  • Affiliations:
  • School of Information Science and Engineering, Northeastern University, Shenyang, China 110819;Department of Electrical and Information Engineering, Northeastern University at Qinhuangdao, Qinhuangdao, China 066000;Department of Electrical and Computer Engineering, University of Massachusetts Amherst, Amherst, USA 01002;Department of Electrical and Information Engineering, Northeastern University at Qinhuangdao, Qinhuangdao, China 066000

  • Venue:
  • Journal of Grid Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.