The Stanford Dash Multiprocessor
Computer
Operating system support for high-performance multiprocessing
Operating system support for high-performance multiprocessing
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tolerating node failures in cache only memory architectures
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Cache-Only Memory Architectures
Computer
The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Distributed shared memory: where we are and where we should be headed
HOTOS '95 Proceedings of the Fifth Workshop on Hot Topics in Operating Systems (HotOS-V)
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
Using Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories
Using Peer Support to Reduce Fault-Tolerant Overhead in Distributed Shared Memories
Hi-index | 0.00 |
Designing multiprocessors based on distributed shared memory (DSM) architecture considerably increases their scalability. But as the number of nodes in a multiprocessor increases, the probability of encountering failures in one or more nodes of the system raises as a serious problem. Thus, every large-scale multiprocessor should be equipped with mechanisms that tolerate node failures. Backward error recovery (BER) is one of the most feasible strategies to build fault tolerant multiprocessors and it can be shown that among various DSM-based architectures, cache only memory architecture (COMA) is the most suitable for implementing BER. The main reason is the existence of built-in mechanisms for data replication in COMA memory system. BER is applicable to COMA multiprocessors with minor hardware redundancy, but it will obviously cause some other kinds of overheads. The most important overhead induced by BER is the time required to produce and store recovery data. This paper introduces an analytical model for predicting the amount of this time overhead and then verifies the correctness of the model through comparing the results predicted from this model with the previously published simulation results. Both the analytical model and simulation results show that the overhead is nearly independent of the number of nodes. The immediate result is that BER is a cost-effective strategy for tolerating node failures in large-scale COMA multiprocessors with large numbers of nodes.