The MOSIX multicomputer operating system for high performance cluster computing
Future Generation Computer Systems - Special issue on HPCN '97
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
User-Level Communication in a System with Gang Scheduling
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
STORM: lightning-fast resource management
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Exploiting Operating System Services to Effciently Checkpoint Parallel Applications in GENESIS
ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Architectural Support for System Software on Large-Scale Clusters
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
ACM SIGOPS Operating Systems Review
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Compiler-Enhanced Incremental Checkpointing
Languages and Compilers for Parallel Computing
Proceedings of the 6th ACM conference on Computing frontiers
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Power and thermal characterization of POWER6 system
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Performance evaluation of fault tolerance techniques in grid computing system
Computers and Electrical Engineering
A hybrid fault tolerance technique in grid computing system
The Journal of Supercomputing
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Replication based fault tolerant job scheduling strategy for economy driven grid
The Journal of Supercomputing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach
International Journal of Communication Networks and Distributed Systems
Future Generation Computer Systems
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Hi-index | 0.00 |
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifi- cally designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.