Predictable time-sharing for DryadLINQ cluster

Authors:
Sang-Min Park;Marty Humphrey
Affiliations:
University of Virginia, Charlottesville, VA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 7th international conference on Autonomic computing
Year:
2010

Citing 21
Cited 0

Adaptive Control

Adaptive Control
Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms*

Real-Time Systems
Feedback Control of Computing Systems

Feedback Control of Computing Systems
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Friendly virtual machines: leveraging a feedback-control model for application adaptation

Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments
Feedback Utilization Control in Distributed Real-Time Systems with End-to-End Tasks

IEEE Transactions on Parallel and Distributed Systems
Triage: Performance differentiation for storage systems using adaptive control

ACM Transactions on Storage (TOS)
Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers

IEEE Transactions on Parallel and Distributed Systems
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Coordinating Multiple Autonomic Managers to Achieve Specified Power-Performance Tradeoffs

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
On the Use of Fuzzy Modeling in Virtualized Data Center Management

ICAC '07 Proceedings of the Fourth International Conference on Autonomic Computing
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Adaptive control of virtualized resources in utility computing environments

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Feedback-controlled resource sharing for predictable eScience

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
vManage: loosely coupled platform and virtualization management in data centers

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Self-Tuning Virtual Machines for Predictable eScience

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Predictable High-Performance Computing Using Feedback Control and Admission Control

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the scheduling problem that popular data parallel programming systems such as DryadLINQ and MapReduce are facing today. Designing a cluster system in a multi-user environment is challenging because cluster schedulers must satisfy multiple, possibly conflicting, enterprise goals and policies. Particularly for these new types of data-intensive applications, it continues to be a challenge to simultaneously achieve both high throughput and predictable end-to-end performance for jobs (e.g., predictable start/end times). The conventional approach of scheduling these types of jobs is to attempt to determine a best mapping between a task and a node before the job executes, and the scheduling system ceases to be involved for a given job once the job starts executing. Instead, as described in this paper, we define a reactive containment and control mechanism for scheduling and executing distributed tasks, schedule the jobs, and then continually monitor and adjust resources as the job executes. More specifically, a DryadLINQ task in our system is contained in virtual machine and distributed controllers regulate progress of the task at runtime. Using online, feedback-controlled VM CPU scheduling, our system provides a job a capability to speed-up or slow-down progress of concurrent sub-tasks so that the job can make predictable progress while sharing system resources with other jobs. The new capability allows an enterprise to enforce flexible scheduling policies such as fair-share and/or prioritizing jobs. Our evaluation results using five well-known DryadLINQ applications show the implemented distributed controllers achieve high throughput as well as predictable end-to-end performance.