Rebound: scalable checkpointing for coherent shared memory

Authors:
Rishi Agarwal;Pranav Garg;Josep Torrellas
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 21
Cited 5

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
The Performance of Cache-Based Error Recovery in Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Fault Recovery Mechanism for Multiprocessor Servers

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Multiprocessor Architecture Using an Audit Trail for Fault Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Phase-change random access memory: a scalable technology

IBM Journal of Research and Development
Architecture Design for Soft Errors

Architecture Design for Soft Errors

UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.