Performing tasks on synchronous restartable message-passing processors

Authors:
Bogdan S. Chlebus;Roberto De Prisco;Alex A. Shvartsman
Affiliations:
Instytut Informatyki, Uniwersytet Warszawski, ul. Banacha 2, 02-097 Warszawa, Poland;Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, NE43-316 Cambridge, MA and Dipartimento di Informatica ed Applicazioni, University of Salerno, 84081 ...;Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, NE43-316 Cambridge, MA and Department of Computer Science and Engineering, University of Connecticut, ...
Venue:
Distributed Computing
Year:
2001

Citing 13
Cited 21

Efficient parallel algorithms can be made robust

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Efficient robust parallel computations

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
Efficient program transformations for resilient parallel computation via randomization (preliminary version)

STOC '92 Proceedings of the twenty-fourth annual ACM symposium on Theory of computing
Performing work efficiently in the presence of faults

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Work-optimal asynchronous algorithms for shared memory parallel computers

SIAM Journal on Computing
Time-optimal message-efficient work performance in the presence of faults

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Parallel algorithms with processor failures and delays

Journal of Algorithms
Fault-tolerant broadcasts and related problems

Distributed systems (2nd Ed.)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Fault-Tolerant Parallel Computation

Fault-Tolerant Parallel Computation
Controlling Memory Access Concurrency in Efficient Fault-Tolerant Parallel Algorithms (Extended Abstract)

WDAG '93 Proceedings of the 7th International Workshop on Distributed Algorithms
Performing Tasks on Restartable Message-Passing Processors

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Resolving message complexity of Byzantine Agreement and beyond

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science

The do-all problem in broadcast networks

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Optimal scheduling for disconnected cooperation

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
The Complexity of Synchronous Iterative Do-All with Crashes

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Bounding Work and Communication in Robust Cooperative Computation

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
Optimal F-Reliable Protocols for the Do-All Problem on Single-Hop Wireless Networks

ISAAC '02 Proceedings of the 13th International Symposium on Algorithms and Computation
distributed cooperation and adversity: complexity trade-offs

PCK50 Proceedings of the Paris C. Kanellakis memorial workshop on Principles of computing & knowledge: Paris C. Kanellakis memorial workshop on the occasion of his 50th birthday
Work-competitive scheduling for cooperative computing with dynamic groups

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Performing work with asynchronous processors: message-delay-sensitive bounds

Proceedings of the twenty-second annual symposium on Principles of distributed computing
Cooperative computing with fragmentable and mergeable groups

Journal of Discrete Algorithms
Randomization helps to perform independent tasks reliably

Random Structures & Algorithms
Task allocation in a multi-server system

Journal of Scheduling
The complexity of synchronous iterative Do-All with crashes

Distributed Computing
The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Performing work with asynchronous processors: message-delay-sensitive bounds

Information and Computation
Efficient gossip and robust distributed computation

Theoretical Computer Science
The Do-All problem with Byzantine processor failures

Theoretical Computer Science - Foundations of software science and computation structures
Dynamic load balancing with group communication

Theoretical Computer Science
A robust randomized algorithm to perform independent tasks

Journal of Discrete Algorithms
Performing work with asynchronous processors: Message-delay-sensitive bounds

Information and Computation
Performing dynamically injected tasks on processes prone to crashes and restarts

DISC'11 Proceedings of the 25th international conference on Distributed computing
Online parallel scheduling of non-uniform tasks: trading failures for energy

FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork, Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f p stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t+ p logp/log log p) ċ log f) and its message complexity is M = O(t + plogp/ log logp +fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures and restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + plogp + f. min{log p, logf}), and its message complexity is M = O(t + plogp + fp), where f is the total number of failures.