Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Cache coherence protocols: evaluation using a multiprocessor simulation model
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Multiprocessor cache synchronization: issues, innovations, evolution
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Checkpoint repair for high-performance out-of-order execution machines
IEEE Transactions on Computers
Implementing Precise Interrupts in Pipelined Processors
IEEE Transactions on Computers
An Experimental Study to Determine Task Size for Rollback Recovery Systems
IEEE Transactions on Computers
Firefly: A Multiprocessor Workstation
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Performance evaluation of multiple register set architectures and cache memories
Performance evaluation of multiple register set architectures and cache memories
Monitors: an operating system structuring concept
Communications of the ACM
IEEE Transactions on Software Engineering
Hardware/software tradeoffs for increased performance
ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An economical solution to the cache coherence problem
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
A virtual memory translation mechanism to support checkpoint and rollback recovery
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Virtual Checkpoints: Architecture and Performance
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Compiler-Based Multiple Instruction Retry
IEEE Transactions on Computers
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Fault-Containment in Cache Memories for TMR Redundant Processor Systems
IEEE Transactions on Computers
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Performance of Cache-Based Error Recovery in Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Building Modern Distributed Systems
Ada Europe '01 Proceedings of the 6th Ade-Europe International Conference Leuven on Reliable Software Technologies
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors
Proceedings of the 33rd annual international symposium on Computer Architecture
Error Recovery in Parallel Systems of Pipelined Processors with Caches
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Multicore soft error rate stabilization using adaptive dual modular redundancy
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.02 |
The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.