The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system

Authors:
J. T. Rough;A. M. Goscinski
Affiliations:
School of Information Technology, Deakin University, Geelong, Vic. 3217, Australia;School of Information Technology, Deakin University, Geelong, Vic. 3217, Australia
Venue:
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Year:
2004

Citing 11
Cited 5

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The RHODOS migration facility

Journal of Systems and Software
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

IEEE Transactions on Parallel and Distributed Systems
Towards an operating system managing parallelism of computing on clusters

Future Generation Computer Systems
A Cluster Operating System Supporting Parallel Computing

Cluster Computing
Remote and Concurrent Process Duplication for SPMD Based Parallel Processing on COWs

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Design, Implementation, and Performance of Checkpointing in NetSolve

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
System-Level Versus User-Defined Checkpointing

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Finding, expressing and managing parallelism in programs executed on clusters of workstations

Computer Communications

2-step algorithm for enhancing effectiveness of sender-based message logging

SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Novel log management for sender-based message logging

ICAI'08 Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information
Lightweight log management algorithm for removing logged messages of sender processes with little overhead

WSEAS Transactions on Computers
A comparative study at the logical level of centralised and distributed recovery in clusters

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
An efficient algorithm for removing useless logged messages in SBML protocols

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research efforts of parallel processing on non-dedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of independent computers used by multiple users, clusters are susceptible to failure. This paper shows the development of a coordinated checkpointing facility for the GENESIS cluster operating system. This facility was developed by exploiting existing operating system services. High performance and low overheads are achieved by allowing the processes of a parallel application to continue executing during the creation of checkpoints, while maintaining low demands on cluster resources by using coordinated checkpointing.