Exploring MapReduce efficiency with highly-distributed data

Authors:
Michael Cardosa;Chenyu Wang;Anshuman Nangia;Abhishek Chandra;Jon Weissman
Affiliations:
University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA
Venue:
Proceedings of the second international workshop on MapReduce and its applications
Year:
2011

Citing 7
Cited 8

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
MOON: MapReduce On Opportunistic eNvironments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

Proceedings of third international workshop on MapReduce and its Applications Date
Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hierarchical MapReduce Programming Model and Scheduling Algorithms

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks

Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
Electric grid balancing through lowcost workload migration

ACM SIGMETRICS Performance Evaluation Review
Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Trustworthy distributed computing on social networks

Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security
A case for MapReduce over the internet

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.