On unreliable computing systems when heavy-tails appear as a result of the recovery procedure

Authors:
Pierre M. Fiorini;Robert Sheahan;Lester Lipsky
Affiliations:
University of Southern Maine, Portland, ME;University of Connecticut, Storrs, CT;University of Connecticut, Storrs, CT
Venue:
ACM SIGMETRICS Performance Evaluation Review - Special issue on the workshop on MAthematical performance Modeling And Analysis (MAMA 2005)
Year:
2005

Citing 1
Cited 7

The Importance of Power-Tail Distributions for Modeling Queueing Systems

Operations Research

On checkpointing and heavy-tails in unreliable computing environments

ACM SIGMETRICS Performance Evaluation Review
On the completion time distribution for tasks that must restart from the beginning if a failure occurs

ACM SIGMETRICS Performance Evaluation Review
Dynamic packet fragmentation for wireless channels with failures

Proceedings of the 9th ACM international symposium on Mobile ad hoc networking and computing
Is ALOHA causing power law delays?

ITC20'07 Proceedings of the 20th international teletraffic conference on Managing traffic performance in converged networks
Modulated Branching Processes, Origins of Power Laws, and Queueing Duality

Mathematics of Operations Research
Uniform approximation of the distribution for the number of retransmissions of bounded documents

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Retransmissions over correlated channels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 31st international symposium on computer performance, modeling, measurements and evaluation (IFIPWG 7.3 Performance 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

For some computing systems, failure is rare enough that it can be ignored. In other systems, failure is so common that how to handle it can have a significant impact on the performance of the system. There are many different recovery schemes for tasks, however, they can be classified into three broad categories: 1) Resume: when a task fails, it knows exactly where it stops and can continue at that point when allowed to resume (i.e., preemptive resume - prs); 2) Replace: when a task fails, then later when the processor continues, it begins with a brand new task (i.e., preemptive repeat different prd); and, 3) Restart: when a task fails it loses all work done to that point and must start anew upon continuing later (i.e., preemptive repeat identical - pri).In this paper, assuming a computing system is unreliable, we discuss how heavy-tail (hereafter referred to as power-tail - PT) distributions can appear in a job's task stream given the Restart recovery procedure. This is an important consideration since it is known that power-tails can lead to unstable systems [4], We then demonstrate how to obtain performance and dependablity measures for a class of computing systems comprised of P unreliable processors and a finite number of tasks N given the above recovery procedures.