SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

Authors:
Daniel J. Sorin;Milo M. K. Martin;Mark D. Hill;David A. Wood
Affiliations:
University of Wisconsin---Madison;University of Wisconsin---Madison;University of Wisconsin---Madison;University of Wisconsin---Madison
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 31
Cited 76

The STRATUS computer system

Resilient computing systems: vol. 1
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
The fuzzy barrier: a mechanism for high speed synchronization of processors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design

Fault-tolerant computer system design
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
Increasing Work, Pushing the Clock

Computer
Reining in Complexity

Computer
Simics: A Full System Simulation Platform

Computer
Verifying a Multiprocessor Cache Controller Using Random Test Generation

IEEE Design & Test
Spider: A High-Speed Network Interconnect

IEEE Micro
Starfire: Extending the SMP Envelope

IEEE Micro
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development

Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes

Proceedings of the 30th annual international symposium on Computer architecture
A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Proceedings of the 30th annual international symposium on Computer architecture
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth

IEEE Micro
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A serializability violation detector for shared-memory server programs

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
Dynamic Verification of Sequential Consistency

Proceedings of the 32nd annual international symposium on Computer Architecture
Energy reduction in multiprocessor systems using transactional memory

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Memory State Compressors for Giga-Scale Checkpoint/Restore

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Truss: A Reliable, Scalable Server Architecture

IEEE Micro
BugNet: Recording Application-Level Execution for Deterministic Replay Debugging

IEEE Micro
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors

Proceedings of the 33rd annual international symposium on Computer Architecture
Recording shared memory dependencies using strata

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

IEEE Micro
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices

IEEE Transactions on Computers
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A hardware/software framework for supporting transactional memory in a MPSoC environment

ACM SIGARCH Computer Architecture News
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
Patching Processor Design Errors with Programmable Hardware

IEEE Micro
Adapting to intermittent faults in multicore systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Atom-Aid: Detecting and Surviving Atomicity Violations

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Techniques for Efficient Software Checking

Languages and Compilers for Parallel Computing
Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
RT-replayer: a record-replay architecture for embedded real-time software debugging

Proceedings of the 2009 ACM symposium on Applied Computing
Dynamic heterogeneity and the need for multicore virtualization

ACM SIGOPS Operating Systems Review
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A case for an interleaving constrained shared-memory multi-processor

Proceedings of the 36th annual international symposium on Computer architecture
Architecture Design for Soft Errors

Architecture Design for Soft Errors
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Offline symbolic analysis for multi-processor execution replay

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Specifying and dynamically verifying address translation-aware memory consistency

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Adaptive online testing for efficient hard fault detection

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Timetraveler: exploiting acyclic races for optimizing memory race recording

Proceedings of the 37th annual international symposium on Computer architecture
Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors

Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults

Proceedings of the 37th annual international symposium on Computer architecture
Fault-Tolerant Flow Control in On-chip Networks

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Using introspective software-based testing for post-silicon debug and repair

Proceedings of the 47th Design Automation Conference
Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
A self-adaptive system architecture to address transistor aging

Proceedings of the Conference on Design, Automation and Test in Europe
CASPAR: hardware patching for multi-core processors

Proceedings of the Conference on Design, Automation and Test in Europe
Checkpoint/restart-enabled parallel debugging

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Data race avoidance and replay scheme for developing and debugging parallel programs on distributed shared memory systems

Parallel Computing
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips

Proceedings of the 48th Design Automation Conference
Reliability analysis for MPSoCs with mixed-critical, hard real-time constraints

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A systematic methodology to develop resilient cache coherence protocols

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Using dynamic task level redundancy for OpenMP fault tolerance

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Specification and synthesis of hardware checkpointing and rollback mechanisms

Proceedings of the 49th Annual Design Automation Conference
UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Runtime verification: a computer architecture perspective

RV'11 Proceedings of the Second international conference on Runtime verification
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)
Semi-automated debugging via binary search through a process lifetime

Proceedings of the Seventh Workshop on Programming Languages and Operating Systems
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A low-power instruction replay mechanism for design of resilient microprocessors

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur.