Sheaved memory: architectural support for state saving and restoration in pages systems
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Continuous checkpointing: joining the checkpointing with virtual memory paging
Software—Practice & Experience
Realizing fault resilience in Web-server cluster
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Migratory TCP: Connection Migration for Service Continuity in the Internet
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving Logging and Recovery Performance in Phoenix/App
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
System support for scalable and fault tolerant internet services
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Hi-index | 0.00 |
Owing to long serving time and huge numbers of clients, Internet services can easily suffer from transient faults. Although restarting a service can solve this problem, information of the on-line requests will be lost owing to the service restart, which is unacceptable for many commercial or transaction-based services. In this paper, we propose an approach to achieve the goal of zero-loss restart for Internet services. Under this approach, a kernel subsystem is responsible for detecting the transient faults, retaining the I/O channels of the service, and managing the service restart flow. In addition, some straightforward modifications to the service should be made to take advantage of the kernel support. To demonstrate the feasibility of our approach, we implemented the subsystem in the Linux kernel. Moreover, we modified a Web server and a CGI program to take advantage of the kernel support. According to the experimental results, our approach incurs little runtime overhead (i.e. less than 3.2%). When the service crashes, it can be restarted quickly (i.e. within 210 μs) with no information loss. Furthermore, the performance impact due to the service crash is small. These results show that the approach can efficiently achieve the goal of zero-loss restart for Internet services. Copyright © 2006 John Wiley & Sons, Ltd.