On the completion time distribution for tasks that must restart from the beginning if a failure occurs

Authors:
Robert Sheahan;Lester Lipsky;Pierre M. Fiorini;Søren Asmussen
Affiliations:
University of Connecticut, Storrs, CT;University of Connecticut, Storrs, CT;University of Southern Maine, Portland, MA;Aarhus University, Denmark
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2006

Citing 1
Cited 4

On unreliable computing systems when heavy-tails appear as a result of the recovery procedure

ACM SIGMETRICS Performance Evaluation Review - Special issue on the workshop on MAthematical performance Modeling And Analysis (MAMA 2005)

Dynamic packet fragmentation for wireless channels with failures

Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing
File fragmentation over an unreliable channel

INFOCOM'10 Proceedings of the 29th conference on Information communications
Uniform approximation of the distribution for the number of retransmissions of bounded documents

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Retransmissions over correlated channels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 31st international symposium on computer performance, modeling, measurements and evaluation (IFIPWG 7.3 Performance 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many systems, failure is so common that the design choice of how to deal with it may have a significant impact on the performance of the system. There are many specific and distinct failure recovery schemes, but they can be grouped into three broad classes: RESUME, also referred to as preemptive resume (prs), or check-pointing; REPLACE, also referred to as preemptive repeat different (prd); and RESTART, also referred to as preemptive repeat identical (pri). The following describes the three recovery schemes: (1) RESUME: when a task is fails, it knows exactly where it stops, and can continue from that point when allowed to resume; (2)REPLACE: given a task fails, then when it begins processing again, it starts with a brand new task sampled from the same task time distribution; and, (3) RESTART: When a task fails, it loses all that it had acquired to up to that point and must start anew when upon continuing later. This is distinctly different from (2) since the task must run at least as long as it did before it failed, whereas a new sample, selected at random, might run for a shorter or longer time.