Low-Latency, Concurrent Checkpointing for Parallel Programs

Authors:
K. Li;J. F. Naughton;J. S. Plank
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1994

Citing 18
Cited 37

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
A Crash Recovery Scheme for a Memory-Resident Database System

IEEE Transactions on Computers
On distributed snapshots

Information Processing Letters
Firefly: a multiprocessor workstation

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The Sprite Network Operating System

Computer
Real-time concurrent collection on stock multiprocessors

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Operating systems concepts

Operating systems concepts
Multiprocessor main memory transaction processing

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
Sheaved memory: architectural support for state saving and restoration in pages systems

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
The integration of virtual memory management and interprocess communication in Accent

ACM Transactions on Computer Systems (TOCS)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Preemptable remote execution facilities for the V-system

Proceedings of the tenth ACM symposium on Operating systems principles
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Atomic Transactions

Distributed Systems - Architecture and Implementation, An Advanced Course

An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Efficient and flexible fault tolerance and migration of scientific simulations using CUMULVS

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
A checkpointing strategy for scalable recovery on distributed parallel systems

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Secure checkpointing

Journal of Systems Architecture: the EUROMICRO Journal
Modeling and optimization of non-blocking checkpointing for optimistic simulation on myrinet clusters

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation

IEEE Transactions on Parallel and Distributed Systems
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs

International Journal of High Performance Computing Applications
A Version of MASM Portable Across Different UNIX Systems and Different Hardware Architectures

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Transparent State Management for Optimistic Synchronization in the High Level Architecture

Simulation
Log-based rollback recovery without checkpoints of shared memory in software DSM

The Journal of Supercomputing
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices

IEEE Transactions on Computers
Multiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters

Journal of Systems Architecture: the EUROMICRO Journal
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Towards highly available and scalable high performance clusters

Journal of Computer and System Sciences
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
Numerical computation algorithms for sequential checkpoint placement

Performance Evaluation
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Benchmarking Memory Management Capabilities within ROOT-Sim

DS-RT '09 Proceedings of the 2009 13th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications
Towards building a highly-available cluster based model for high performance computing

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
enhancing fault-tolerance of large-scale MPI scientific applications

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Optimizing VM checkpointing for restore performance in VMware ESXi

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Presents the results of an implementation of several algorithms for checkpointing andrestarting parallel programs on shared-memory multiprocessors. The algorithms arecompared according to the metrics of overall checkpointing time, overhead imposed bythe checkpointer on the target program, and amount of time during which thecheckpointer interrupts the target program. The best algorithm measured achieves itsefficiency through a variation of copy-on-write, which allows the most time-consumingoperations of the checkpoint to be overlapped with the running of the program beingcheckpointed.