Kernel support for zero-loss Internet service restart

Authors:
Da-Wei Chang;Chuan-Ming Tsai;Wei-Kou Li;Tzu-Rung Lee
Affiliations:
Department of Computer Science and Information Engineering, National Cheng-Kung University, Tainan, Taiwan;Department of Computer Science, National Chiao-Tung University, HsinChu, Taiwan;Department of Computer Science, National Chiao-Tung University, HsinChu, Taiwan;Department of Computer Science, National Chiao-Tung University, HsinChu, Taiwan
Venue:
Software—Practice & Experience
Year:
2007

Citing 16
Cited 0

Sheaved memory: architectural support for state saving and restoration in pages systems

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Continuous checkpointing: joining the checkpointing with virtual memory paging

Software—Practice & Experience
Realizing fault resilience in Web-server cluster

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Migratory TCP: Connection Migration for Service Continuity in the Internet

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Improving the reliability of commodity operating systems

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving Logging and Recovery Performance in Phoenix/App

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Undo for operators: building an undoable e-mail store

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
System support for scalable and fault tolerant internet services

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Owing to long serving time and huge numbers of clients, Internet services can easily suffer from transient faults. Although restarting a service can solve this problem, information of the on-line requests will be lost owing to the service restart, which is unacceptable for many commercial or transaction-based services. In this paper, we propose an approach to achieve the goal of zero-loss restart for Internet services. Under this approach, a kernel subsystem is responsible for detecting the transient faults, retaining the I/O channels of the service, and managing the service restart flow. In addition, some straightforward modifications to the service should be made to take advantage of the kernel support. To demonstrate the feasibility of our approach, we implemented the subsystem in the Linux kernel. Moreover, we modified a Web server and a CGI program to take advantage of the kernel support. According to the experimental results, our approach incurs little runtime overhead (i.e. less than 3.2%). When the service crashes, it can be restarted quickly (i.e. within 210 μs) with no information loss. Furthermore, the performance impact due to the service crash is small. These results show that the approach can efficiently achieve the goal of zero-loss restart for Internet services. Copyright © 2006 John Wiley & Sons, Ltd.