The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times

Authors:
Robert Sheahan;Lester Lipsky;Pierre Fiorini
Affiliations:
University of Connecticut, Storrs, CT;University of Connecticut, Storrs, CT;University of Southern Maine, Portland, Maine
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Year:
2005

Citing 5
Cited 3

Time-optimal message-efficient work performance in the presence of faults

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
The Importance of Power-Tail Distributions for Modeling Queueing Systems

Operations Research
An Analytic Performance Model of Parallel Systems that Perform N Tasks Using P Processors That Can Fail

NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)
Performing tasks on synchronous restartable message-passing processors

Distributed Computing

Dynamic packet fragmentation for wireless channels with failures

Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing
Load balancing in the presence of random node failure and recovery

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Probabilistic resource allocation in heterogeneous distributed systems with random failures

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a system to be reliable, it must have one or more methods of dealing with failures. Distributed systems face both node failure and communication channel failure. Communication channels, in particular, may suffer failures at a very high rate. Different systems respond to task failure in different ways. The system may resume a failed task from the failure point (or a saved checkpoint shortly before the failure point), it may restart the task, or it may give up on the task and select a replacement task from the ready queue. These three responses to failure all change the distribution of task completion times. The distribution of completion times is important because it governs mean service time and queue length, and therefore quality of service and buffer size necessary to manage the risk of overflow. The changes to the distribution introduced by the failure response can even turn well behaved exponentially distributed times into powertail distributed times with infinite mean and variance. In this paper we examine the characteristics of distributions that result from restarting after each interrupt, with some discussion of Resume and Replace, for comparison. We provide analytic and simulation solutions.