MR-runner: a modularized map-reduce job management tool

Authors:
Xinsheng Yang;Wei Wang;Lijie Xu;Jie liu;Jun Wei
Affiliations:
Chinese Academy of Sciences, Beijing, P.R. China;Chinese Academy of Sciences, Beijing, P.R. China;Chinese Academy of Sciences, Beijing, P.R. China;Chinese Academy of Sciences, Beijing, P.R. China;Chinese Academy of Sciences, Beijing, P.R. China
Venue:
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Year:
2013

Citing 8
Cited 0

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Map-Reduce is a powerful solution for processing and analyzing large-scale data. Just as Hadoop and Spark are able to deal with terabyte data and even more. Users only need to complete "map" and "reduce" function, the Map-Reduce framework can finish variety jobs. But many machine learning and data mining algorithms cannot leverage the Map-Reduce framework or it would take large efforts to modify the algorithm itself. This issue can be explained by the following ways: 1. Map-Reduce is a batch operation so that most of Map-Reduce frameworks do not built-in to support iteration. 2. Map-Reduce is absolutely parallel, each vertex cannot obtain all records, so none of them could get the global optimal model. In this paper, we proposed a job management tool to enable the Map-Reduce framework to support iteration, called "de-parallel". This make the Map-Reduce framework like Hadoop so that Map-Reduce could run more algorithms and support more various tasks. In addition, our tool does not modify the Map-Reduce framework itself. In face MR-Runner interacts with Map-Reduce framework like a "client", therefore MR-Runner could be deployed in any single PC instead of Map-Reduce cluster. We also abstract the mainly interface related to Map-Reduce frameworks, this makes our tool portable to the representative Map-Reduce frameworks.