Assigning tasks for efficiency in Hadoop: extended abstract

Authors:
Michael J. Fischer;Xueyuan Su;Yitong Yin
Affiliations:
Yale University, New Haven, CT, USA;Yale University, New Haven, CT, USA;Nanjing University, Nanjing, China
Venue:
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Year:
2010

Citing 13
Cited 4

Approximation algorithms for scheduling unrelated parallel machines

Mathematical Programming: Series A and B
The competitiveness of on-line assignments

SODA '92 Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms
On-line routing of virtual circuits with applications to load balancing and machine scheduling

Journal of the ACM (JACM)
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Introduction to Algorithms

Introduction to Algorithms
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Google's MapReduce programming model — Revisited

Science of Computer Programming
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Toward a cloud computing research agenda

ACM SIGACT News
Open-source grid technologies for web-scale computing

ACM SIGACT News
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
On scheduling in map-reduce and flow-shops

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Oracle in-database hadoop: when mapreduce meets RDBMS

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
HAT: history-based auto-tuning MapReduce in heterogeneous environments

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years Google's MapReduce has emerged as a leading large-scale data processing architecture. Adopted by companies such as Amazon, Facebook, Google, IBM and Yahoo! in daily use, and more recently put in use by several universities, it allows parallel processing of huge volumes of data over cluster of machines. Hadoop is a free Java implementation of MapReduce. In Hadoop, files are split into blocks and replicated and spread over all servers in a network. Each job is also split into many small pieces called tasks. Several tasks are processed on a single server, and a job is not completed until all the assigned tasks are finished. A crucial factor that affects the completion time of a job is the particular assignment of tasks to servers. Given a placement of the input data over servers, one wishes to find the assignment that minimizes the completion time. In this paper, an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem. It is shown that there is no feasible algorithm to find the optimal Hadoop task assignment unless P = NP. Assignments that are computed by the round robin algorithm inspired by the current Hadoop scheduler are shown to deviate from optimum by a multiplicative factor in the worst case. A flow-based algorithm is presented that computes assignments that are optimal to within an additive constant.