SPHINX: A Fault-Tolerant System for Scheduling in Dynamic Grid Environments

Authors:
Jang-uk In;Paul Avery;Richard Cavanaugh;Laukik Chitnis;Mandar Kulkarni;Sanjay Ranka
Affiliations:
University of Florida;University of Florida;University of Florida;University of Florida;University of Florida;University of Florida
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Year:
2005

Citing 10
Cited 3

Runtime support for parallelization of data-parallel applications on adaptive and nonuniform computational environments

Journal of Parallel and Distributed Computing
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
A Distributed Heterogeneous Supercomputing Management System

Computer
DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors

IEEE Transactions on Parallel and Distributed Systems
Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation

SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Performance Study of Monitoring and Information Services for Distributed Systems

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
The Grid2003 Production Grid: Principles and Practice

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
A Peer-to-Peer Replica Location Service Based on a Distributed Hash Table

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications

Experiments with in-transit processing for data intensive grid workflows

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
The reliability analysis of resiliency framework for Grid Services

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Using agreement services in grid computing

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

A grid consists of high-end computational, storage, and network resources that, while known a priori, are dynamic with respect to activity and availability. Efficient scheduling of requests to use grid resources must adapt to this dynamic environment while meeting administrative policies. In this paper, we describe a framework called SPHINX that can administrate grid policies, and schedule complex and data intensive scientific applications. We present experimental results for several scheduling strategies that effectively utilize the monitoring and job-tracking information provided by SPHINX. These results demonstrate that SPHINX can effectively schedule work across a large number of distributed clusters that are owned by multiple units in a virtual organization in a fault-tolerant way in spite of the highly dynamic nature of the grid and complex policy issues. The novelty lies in use of effective monitoring of resources and job execution tracking in making scheduling decisions and fault-tolerance - something that is missed in todayýs grid environments.