Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback
IEEE Transactions on Computers
Parallel algorithms for test generation and fault simulation
Parallel algorithms for test generation and fault simulation
Address tracing of parallel systems via TRAPEDS
Microprocessors & Microsystems
Virtual Checkpoints: Architecture and Performance
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Compiler-Based Multiple Instruction Retry
IEEE Transactions on Computers
Implementing a cache consistency protocol
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Computer Communications
Hi-index | 0.00 |
Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval.