RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
Cluster-based scalable network services
Proceedings of the sixteenth ACM symposium on Operating systems principles
Fault isolation and event correlation for integrated fault management
Proceedings of the fifth IFIP/IEEE international symposium on Integrated network management V : integrated management in a virtual world: integrated management in a virtual world
Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors
Proceedings of the seventeenth ACM symposium on Operating systems principles
On the Necessity of On-line-BIST in Safety-Critical Applications - A Case-Study
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Towards availability benchmarks: a case study of software raid systems
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Auto-diagnosis of field problems in an appliance operating system
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
IEEE Transactions on Computers
Practical dynamic software updating for C
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Mutatis Mutandis: Safe and predictable dynamic software updating
ACM Transactions on Programming Languages and Systems (TOPLAS)
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
Dynamic software updates: a VM-centric approach
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Hi-index | 0.00 |
We introduce the ROC-1 hardware platform, a large-scale cluster system designed to provide high availability for Internet service applications. The ROC-1 prototype embodies our philosophy of Recovery-Oriented Computing (ROC) by emphasizing detection and recovery from the failures that inevitably occur in Internet service environments, rather than simple avoidance of such failures. ROC-1 promises greater availability than existing server systems by incorporating four techniques applied from the ground up to both hardware and software: redundancy and isolation, online self-testing and verification, support for problem diagnosis, and concern for human interaction with the system.