A hierarchical framework for cross-domain MapReduce execution

Authors:
Yuan Luo;Zhenhua Guo;Yiming Sun;Beth Plale;Judy Qiu;Wilfred W. Li
Affiliations:
Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;Indiana University, Bloomington, IN, USA;University of California, San Diego, La Jolla, CA, USA
Venue:
Proceedings of the second international workshop on Emerging computational methods for the life sciences
Year:
2011

Citing 14
Cited 7

Utopia: a load sharing facility for large, heterogeneous distributed computer systems

Software—Practice & Experience
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Architectural Models for Resource Management in the Grid

GRID '00 Proceedings of the First IEEE/ACM International Workshop on Grid Computing
Job Scheduling Under the Portable Batch System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
A framework for adaptive execution in grids

Software—Practice & Experience
BDT: an easy-to-use front-end application for automation of massive docking tasks and complex docking strategies with AutoDock

Bioinformatics
TORQUE resource manager

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Sky Computing

IEEE Internet Computing
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A virtual network (ViNe) architecture for grid computing

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science

Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data

Proceedings of third international workshop on MapReduce and its Applications Date
Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hierarchical MapReduce Programming Model and Scheduling Algorithms

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Investigation of Data Locality in MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks

Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MapReduce programming model provides an easy way to execute pleasantly parallel applications. Many data-intensive life science applications fit this programming model and benefit from the scalability that can be delivered using this model. One such application is AutoDock, which consists of a suite of automated tools for predicting the bound conformations of flexible ligands to macromolecular targets. However, researchers also need sufficient computation and storage resources to fully enjoy the benefit of MapReduce. For example, a typical AutoDock based virtual screening experiment usually consists of a very large number of docking processes from multiple ligands and is often time consuming to run on a single MapReduce cluster. Although commercial clouds can provide virtually unlimited computation and storage resources on-demand, due to financial, security and possibly other concerns, many researchers still run experiments on a number of small clusters with limited number of nodes that cannot unleash the full power of MapReduce. In this paper, we present a hierarchical MapReduce framework that gathers computation resources from different clusters and run MapReduce jobs across them. The global controller in our framework splits the data set and dispatches them to multiple "local" MapReduce clusters, and balances the workload by assigning tasks in accordance to the capabilities of each cluster and of each node. The local results are then returned back to the global controller for global reduction. Our experimental evaluation using AutoDock over MapReduce shows that our load-balancing algorithm makes promising workload distribution across multiple clusters, and thus minimizes overall execution time span of the entire MapReduce execution.