Error Recovery in Shared Memory Multiprocessors Using Private Caches

Authors:
K. L. Wu;W. K. Fuchs;J. H. Patel
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1990

Citing 17
Cited 19

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Multiprocessor cache synchronization: issues, innovations, evolution

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Checkpoint repair for high-performance out-of-order execution machines

IEEE Transactions on Computers
Synchronization, Coherence, and Event Ordering in Multiprocessors

Computer
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Implementing Precise Interrupts in Pipelined Processors

IEEE Transactions on Computers
An Experimental Study to Determine Task Size for Rollback Recovery Systems

IEEE Transactions on Computers
Firefly: A Multiprocessor Workstation

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
A characterization of sharing in parallel programs and its application to coherency protocol evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Performance evaluation of multiple register set architectures and cache memories

Performance evaluation of multiple register set architectures and cache memories
Monitors: an operating system structuring concept

Communications of the ACM
Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation

IEEE Transactions on Software Engineering
Hardware/software tradeoffs for increased performance

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An economical solution to the cache coherence problem

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture

A virtual memory translation mechanism to support checkpoint and rollback recovery

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Virtual Checkpoints: Architecture and Performance

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Compiler-Based Multiple Instruction Retry

IEEE Transactions on Computers
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Fault-Containment in Cache Memories for TMR Redundant Processor Systems

IEEE Transactions on Computers
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Processor and Memory-Based Checkpoint and Rollback Recovery

Computer
The Performance of Cache-Based Error Recovery in Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Building Modern Distributed Systems

Ada Europe '01 Proceedings of the 6th Ade-Europe International Conference Leuven on Reliable Software Technologies
Micro-Checkpointing: Checkpointing for Multithreaded Applications

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Proceedings of the 33rd annual international symposium on Computer Architecture
Error Recovery in Parallel Systems of Pipelined Processors with Caches

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Multicore soft error rate stabilization using adaptive dual modular redundancy

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.02

Visualization

Abstract

The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.