ACM Transactions on Database Systems (TODS)
Probability and statistics with reliability, queuing and computer science applications
Probability and statistics with reliability, queuing and computer science applications
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Improving storage system availability with D-GRAID
ACM Transactions on Storage (TOS)
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Triage: Performance differentiation for storage systems using adaptive control
ACM Transactions on Storage (TOS)
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
pClock: an arrival curve based approach for QoS guarantees in shared storage systems
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using Rescue Points to Navigate Software Recovery
SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Architecting Dependable and Secure Systems Using Virtualization
Architecting Dependable Systems V
Hi-index | 0.00 |
In this paper we extend a previously published approach to error recovery in enterprise storage controllers with multi-core processors. Our approach first involves the partitioning of the set of tasks in the runtime of the controller software into clusters (recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a set of recovery groups, on which the scheduling of tasks, both during the recovery process and normal operation, is based. This recovery-aware scheduling (RAS) replaces the performance-based scheduling of the storage controller. Through simulation and benchmark experiments, we find that: 1) the performance of RAS appears to be critically dependent on the values of recovery-related parameters; and 2) our fine-grained recovery approach promises to enhance the storage system availability while keeping the additional overhead, and the resulting degradation in performance, under control.