A bridging model for parallel computation
Communications of the ACM
ACM Computing Surveys (CSUR)
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Parallel Evaluation of General Arithmetic Expressions
Journal of the ACM (JACM)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Software Engineering
Direct Bulk-Synchronous Parallel Algorithms
SWAT '92 Proceedings of the Third Scandinavian Workshop on Algorithm Theory
Simulation-based Comparison of Hash Functions for Emulated Shared Memory
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Design of a Router for Fault-Tolerant Networks
PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
Randomized Shared Memory - Concept and Efficiency of a Scalable Shared Memory Scheme
Parallel Computer Architectures: Theory, Hardware, Software, Applications
Gracefully Degrading Systems Using the Bulk-Synchronous Parallel Model with Randomised Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Hi-index | 14.98 |
The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate