A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation

Authors:
Andreas Savva;Takashi Nanya
Affiliations:
Fujitsu Ltd., Kawasaki, Japan;Univ. of Tokyo, Tokyo, Japan
Venue:
IEEE Transactions on Computers
Year:
1999

Citing 13
Cited 0

Randomized and deterministic simulations of PRAMs by parallel machines with restricted granularity of parallel memories

Acta Informatica
A bridging model for parallel computation

Communications of the ACM
Self-stabilization

ACM Computing Surveys (CSUR)
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs

IEEE Transactions on Parallel and Distributed Systems
Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation

IEEE Transactions on Software Engineering
Direct Bulk-Synchronous Parallel Algorithms

SWAT '92 Proceedings of the Third Scandinavian Workshop on Algorithm Theory
Simulation-based Comparison of Hash Functions for Emulated Shared Memory

PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Design of a Router for Fault-Tolerant Networks

PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
Randomized Shared Memory - Concept and Efficiency of a Scalable Shared Memory Scheme

Parallel Computer Architectures: Theory, Hardware, Software, Applications
Gracefully Degrading Systems Using the Bulk-Synchronous Parallel Model with Randomised Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

Quantified Score

Hi-index	14.98

Visualization

Abstract

The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate