Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
The Performance of Cache-Based Error Recovery in Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Fault Recovery Mechanism for Multiprocessor Servers
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Multiprocessor Architecture Using an Audit Trail for Fault Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
DRAMsim: a memory system simulator
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Bulk Disambiguation of Speculative Threads in Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Phase-change random access memory: a scalable technology
IBM Journal of Research and Development
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the 39th Annual International Symposium on Computer Architecture
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A 1 PB/s file system to checkpoint three million MPI tasks
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Proceedings of the ACM International Conference on Computing Frontiers
Hi-index | 0.00 |
As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.