ROC-1: Hardware Support for Recovery-Oriented Computing

Authors:
David Oppenheimer;Aaron Brown;James Beck;Daniel Hettena;Jon Kuroda;Noah Treuhaft;David A. Patterson;Katherine Yelick
Affiliations:
Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley;Univ. of California , Berkeley
Venue:
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Year:
2002

Citing 10
Cited 5

RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Fault isolation and event correlation for integrated fault management

Proceedings of the fifth IFIP/IEEE international symposium on Integrated network management V : integrated management in a virtual world: integrated management in a virtual world
Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors

Proceedings of the seventeenth ACM symposium on Operating systems principles
Sources of Failure in the Public Switched Telephone Network

Computer
On the Necessity of On-line-BIST in Safety-Critical Applications - A Case-Study

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Auto-diagnosis of field problems in an appliance operating system

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference

Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures

IEEE Transactions on Computers
Practical dynamic software updating for C

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Mutatis Mutandis: Safe and predictable dynamic software updating

ACM Transactions on Programming Languages and Systems (TOPLAS)
Self-recovery in server programs

Proceedings of the 2009 international symposium on Memory management
Dynamic software updates: a VM-centric approach

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce the ROC-1 hardware platform, a large-scale cluster system designed to provide high availability for Internet service applications. The ROC-1 prototype embodies our philosophy of Recovery-Oriented Computing (ROC) by emphasizing detection and recovery from the failures that inevitably occur in Internet service environments, rather than simple avoidance of such failures. ROC-1 promises greater availability than existing server systems by incorporating four techniques applied from the ground up to both hardware and software: redundancy and isolation, online self-testing and verification, support for problem diagnosis, and concern for human interaction with the system.