On the feasibility of dynamic rescheduling on the Intel Distributed Computing Platform

Authors:
Zhuoyao Zhang;Linh T. X. Phan;Godfrey Tan;Saumya Jain;Harrison Duong;Boon Thau Loo;Insup Lee
Affiliations:
University of Pennsylvania;University of Pennsylvania;Intel Corporation;University of Pennsylvania;University of Pennsylvania;University of Pennsylvania;University of Pennsylvania
Venue:
Proceedings of the 11th International Middleware Conference Industrial track
Year:
2010

Citing 7
Cited 1

The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Job Superscheduler Architecture and Performance in Computational Grid Environments

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast transparent migration for virtual machines

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
High-end workstation compute farms using windows NT

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Resource allocation in grid computing

Journal of Scheduling
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

On-line fair allocations based on bottlenecks and global priorities

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines the feasibility of dynamic rescheduling techniques for effectively utilizing compute resources within a data center. Our work is motivated by practical concerns of Intel Distributed Computing Platform (IDCP), an Internet-scale data center based distributed computing platform developed by Intel Corporation for massively parallel chip simulations within the company. IDCP has been operational for many years, and currently is deployed live on tens of thousands of machines that are globally distributed at various data centers. We perform an analysis of job execution traces obtained over a one year period collected from tens of thousands of IDCP machines from 20 different pools. Our analysis shows that the IDCP currently does not make full use of all the resources. Specifically, the job completion time can be severely impacted due to job suspension when high priority jobs preempt low priority jobs. We then develop dynamic job rescheduling strategies that adaptively restart jobs to available resources elsewhere, which better utilize system resources and improve completion times. Our trace-driven evaluation results show that dynamic rescheduling enables IDCP to significantly reduce system waste and completion time of suspended jobs.