Supporting fault-tolerance for time-critical events in distributed environments

Authors:
Qian Zhu;Gagan Agrawal
Affiliations:
Ohio State University, Columbus, OH;Ohio State University, Columbus, OH
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 27
Cited 0

Swarm intelligence

Swarm intelligence
Volume rendering

SIGGRAPH '88 Proceedings of the 15th annual conference on Computer graphics and interactive techniques
Multi-Objective Optimization Using Evolutionary Algorithms

Multi-Objective Optimization Using Evolutionary Algorithms
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Design and Evaluation of a Resource Selection Framework for Grid Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Policy Driven Heterogeneous Resource Co-Allocation with Gangmatching

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Optimal Scheduling for Fault-Tolerant and Firm Real-Time Systems

RTCSA '98 Proceedings of the 5th International Conference on Real-Time Computing Systems and Applications
Fault-Tolerant Scheduling in Distributed Real-Time Systems

ICCNMC '01 Proceedings of the 2001 International Conference on Computer Networks and Mobile Computing (ICCNMC'01)
A Bi-Criteria Scheduling Heuristic for Distributed Embedded Systems under Reliability and Real-Time Constraints

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Realistic Modeling and Svnthesis of Resources for Computational Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Biobjective Scheduling Algorithms for Execution Time–Reliability Trade-off in Heterogeneous Computing Systems*

The Computer Journal
A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

Journal of Parallel and Distributed Computing
Fault-tolerant grid services using primary-backup: feasibility and performance

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Interactive Level-of-Detail Selection Using Image-Based Quality Metric for Large Volume Visualization

IEEE Transactions on Visualization and Computer Graphics
A Systematic Approach for Application Migration in a Grid Computing Environment

APSCC '06 Proceedings of the 2006 IEEE Asia-Pacific Conference on Services Computing
Ridge: combining reliability and performance in open grid platforms

Proceedings of the 16th international symposium on High performance distributed computing
A provisioning model and its comparison with best-effort for performance-cost optimization in grids

Proceedings of the 16th international symposium on High performance distributed computing
Adaptive Reputation-Based Scheduling on Unreliable Distributed Infrastructures

IEEE Transactions on Parallel and Distributed Systems
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms

Scientific Programming - Scientific Workflows
An Adaptive Middleware for Supporting Time-Critical Event Response

ICAC '08 Proceedings of the 2008 International Conference on Autonomic Computing
A resource allocation approach for supporting time-critical applications in grid environments

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Bayesian model for predicting reliability of software systems at the architectural level

QoSA'07 Proceedings of the Quality of software architectures 3rd international conference on Software architectures, components, and applications
Real-time multimodal medical image processing: a dynamic volume-rendering application

IEEE Transactions on Information Technology in Biomedicine
Interactive Particle Swarm: A Pareto-Adaptive Metaheuristic to Multiobjective Optimization

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing the success-rate, which is the probability of finishing execution without failures. However, for the cases where failures do occur, we have developed a hybrid failure-recovery scheme to ensure that the application can complete within the pre-specified time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply the hybrid failure recovery scheme, and the success-rate becomes 100%.