ACM Transactions on Database Systems (TODS)
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Software Dependability in the Tandem GUARDIAN System
IEEE Transactions on Software Engineering
System structure for software fault tolerance
Proceedings of the international conference on Reliable software
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
IBM TotalStorage Enterprise Storage Server: A designer's view
IBM Systems Journal
Improving storage system availability with D-GRAID
ACM Transactions on Storage (TOS)
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Triage: Performance differentiation for storage systems using adaptive control
ACM Transactions on Storage (TOS)
Storage performance virtualization via throughput and latency control
ACM Transactions on Storage (TOS)
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
pClock: an arrival curve based approach for QoS guarantees in shared storage systems
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using Rescue Points to Navigate Software Recovery
SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy
A systematic approach to system state restoration during storage controller micro-recovery
FAST '09 Proccedings of the 7th conference on File and storage technologies
IBM Journal of Research and Development
Hi-index | 0.01 |
In this paper we develop a recovery conscious framework for multi-core architectures and a suite of techniques for improving the resiliency and recovery efficiency of highly concurrent embedded storage software systems. Our techniques aim at providing continuous availability and performance during recovery while minimizing the time to recovery and the need for rearchitecting the system (legacy code). The main contributions of our recovery conscious framework include (1) a task-level recovery model, which consists of mechanisms for classifying storage tasks into recovery groups and dividing the overall system resources into recovery-oriented resource pools, and (2) the development of recovery-conscious scheduling, which enforces some serializability of failure-dependent tasks in order to reduce the ripple effect of software failure and improve the availability of the system. We present three alternative recovery-conscious scheduling algorithms; each represents one way to trade-off between recovery time and system performance. We have implemented and evaluated these recovery-conscious scheduling algorithms on a real industry-standard storage system. Our experimental evaluation results show that the proposed recovery conscious scheduling algorithms are non-intrusive and can significantly improve (throughput by 16.3% and response time by 22.9%) the performance of the system during failure recovery.