Memory coherence in shared virtual memory systems
PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
A Crash Recovery Scheme for a Memory-Resident Database System
IEEE Transactions on Computers
Firefly: a multiprocessor workstation
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Real-time concurrent collection on stock multiprocessors
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Multiprocessor main memory transaction processing
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
Sheaved memory: architectural support for state saving and restoration in pages systems
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Debugging distributed C programs by real time reply
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Supporting reverse execution for parallel programs
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Implementation techniques for main memory database systems
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Distributed Systems - Architecture and Implementation, An Advanced Course
Virtual memory primitives for user programs
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The interaction of architecture and operating system design
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Run-time monitoring of concurrent programs on the Cedar multiprocessor
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Replay, recovery, replication, and snapshots of nondeterministic concurrent programs
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
A bibliography of parallel debuggers, 1990 edition
ACM SIGPLAN Notices
A virtual memory translation mechanism to support checkpoint and rollback recovery
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Architecture support for single address space operating systems
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Virtual Checkpoints: Architecture and Performance
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Issues and directions in scalable parallel computing
PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
History cache: hardware support for reverse execution
ACM SIGARCH Computer Architecture News
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
A new checkpoint mechanism for real time operating systems
ACM SIGOPS Operating Systems Review
Tolerating node failures in cache only memory architectures
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
A Fault Tolerant Hybrid Memory Structure and Memory Management Algorithms
IEEE Transactions on Computers
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Distributed Checkpointing Mechanism for a Parallel File System
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Recovery for Distributed Shared Memory Applications
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Multiversioning and Logging in the Grasshopper Kernel Persistent Store
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Framework for supporting multi-service edge packet processing on network processors
Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
Selective early request termination for busy internet services
Proceedings of the 15th international conference on World Wide Web
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Schedulable persistence system for teal-time applications in virtual machine
EMSOFT '06 Proceedings of the 6th ACM & IEEE International conference on Embedded software
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Kernel support for zero-loss Internet service restart
Software—Practice & Experience
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Experimental Assessment of the Practicality of a Fault-Tolerant System
SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Persistent fault-tolerance for divide-and-conquer applications on the grid
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Hi-index | 0.00 |
We have developed and implemented a checkpointing and restart algorithm for parallel programs running on commercial uniprocessors and shared-memory multiprocessors. The algorithm runs concurrently with the target program, interrupts the target program for small, fixed amounts of time and is transparent to the checkpointed program and its compiler. The algorithm achieves its efficiency through a novel use of address translation hardware that allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.