Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

Authors:
Sangeetha Seshadri;Lawrence Chiu;Cornel Constantinescu;Subashini Balachandran;Clem Dickey;Ling Liu;Paul Muench
Affiliations:
Georgia Institute of Technology, GA;IBM Almaden Research Center, CA;IBM Almaden Research Center, CA;IBM Almaden Research Center, CA;IBM Almaden Research Center, CA;Georgia Institute of Technology, GA;IBM Almaden Research Center, CA
Venue:
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Year:
2008

Citing 17
Cited 2

ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging

ACM Transactions on Database Systems (TODS)
Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
Software Dependability in the Tandem GUARDIAN System

IEEE Transactions on Software Engineering
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
IBM TotalStorage Enterprise Storage Server: A designer's view

IBM Systems Journal
Improving storage system availability with D-GRAID

ACM Transactions on Storage (TOS)
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Triage: Performance differentiation for storage systems using adaptive control

ACM Transactions on Storage (TOS)
Storage performance virtualization via throughput and latency control

ACM Transactions on Storage (TOS)
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
pClock: an arrival curve based approach for QoS guarantees in shared storage systems

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using Rescue Points to Navigate Software Recovery

SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy

A systematic approach to system state restoration during storage controller micro-recovery

FAST '09 Proccedings of the 7th conference on File and storage technologies
Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with multi-core processors

IBM Journal of Research and Development

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we develop a recovery conscious framework for multi-core architectures and a suite of techniques for improving the resiliency and recovery efficiency of highly concurrent embedded storage software systems. Our techniques aim at providing continuous availability and performance during recovery while minimizing the time to recovery and the need for rearchitecting the system (legacy code). The main contributions of our recovery conscious framework include (1) a task-level recovery model, which consists of mechanisms for classifying storage tasks into recovery groups and dividing the overall system resources into recovery-oriented resource pools, and (2) the development of recovery-conscious scheduling, which enforces some serializability of failure-dependent tasks in order to reduce the ripple effect of software failure and improve the availability of the system. We present three alternative recovery-conscious scheduling algorithms; each represents one way to trade-off between recovery time and system performance. We have implemented and evaluated these recovery-conscious scheduling algorithms on a real industry-standard storage system. Our experimental evaluation results show that the proposed recovery conscious scheduling algorithms are non-intrusive and can significantly improve (throughput by 16.3% and response time by 22.9%) the performance of the system during failure recovery.