A bridging model for parallel computation
Communications of the ACM
ACM Computing Surveys (CSUR)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Synthesis of Algorithm-Based Fault-Tolerant Systems from Dependence Graphs
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Software Engineering
Direct Bulk-Synchronous Parallel Algorithms
SWAT '92 Proceedings of the Third Scandinavian Workshop on Algorithm Theory
Simulation-based Comparison of Hash Functions for Emulated Shared Memory
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Design of a Router for Fault-Tolerant Networks
PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
On the Practical Efficiency of Randomized Shared Memory
CONPAR '92/ VAPP V Proceedings of the Second Joint International Conference on Vector and Parallel Processing: Parallel Processing
Randomized Shared Memory - Concept and Efficiency of a Scalable Shared Memory Scheme
Parallel Computer Architectures: Theory, Hardware, Software, Applications
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer
IEEE Transactions on Computers
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation
IEEE Transactions on Computers
Hi-index | 0.00 |
Abstract: The bulk-synchronous parallel model (BSPM) was proposed as a bridging model for parallel computation by Valiant (1990). By using randomised shared memory (RSM), this model offers an asymptotically optimal emulation of the PRAM. By using the BSPM with RSM, we show how a gracefully degrading massively parallel system can be obtained through: memory duplication to ensure global memory integrity, and to speed up the reconfiguration; a global reconfiguration method that restores the logical properties of the system, after a fault occurs. We assume fail-stop processors, single faults, no spare processors, and no significant loss of network throughput as a result of faults. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. The overhead of the scheme and the graceful degradation achieved depend on the program being executed. We evaluate the reconfiguration, overhead, and graceful degradation of the system experimentally.