A systematic approach to system state restoration during storage controller micro-recovery

Authors:
Sangeetha Seshadri;Lawrence Chiu;Ling Liu
Affiliations:
Georgia Institute of Technology;IBM Almaden Research Center;Georgia Institute of Technology
Venue:
FAST '09 Proccedings of the 7th conference on File and storage technologies
Year:
2009

Citing 23
Cited 0

Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging

ACM Transactions on Database Systems (TODS)
The design and evolution of C++

The design and evolution of C++
The HP AutoRAID hierarchical storage system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
RecPlay: a fully integrated practical record/replay system

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Software fault tolerance techniques and implementation

Software fault tolerance techniques and implementation
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Transaction Processing: Concepts and Techniques

Transaction Processing: Concepts and Techniques
System structure for software fault tolerance

Proceedings of the international conference on Reliable software
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using Rescue Points to Navigate Software Recovery

SP '07 Proceedings of the 2007 IEEE Symposium on Security and Privacy
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Micro-recovery, or failure recovery at a fine granularity, is a promising approach to improve the recovery time of software for modern storage systems. Instead of stalling the whole system during failure recovery, micro-recovery can facilitate recovery by a single thread while the system continues to run. A key challenge in performing micro-recovery is to be able to perform efficient and effective state restoration while accounting for dynamic dependencies between multiple threads in a highly concurrent environment. We present Log(Lock), a practical and flexible architecture for performing state restoration without re-architecting legacy code. We formally model thread dependencies based on accesses to both shared state and resources. The Log(Lock) execution model tracks dependencies at runtime and captures the failure context through the restoration level. We develop restoration protocols based on recovery points and restoration levels that identify when micro-recovery is possible and the recovery actions that need to be performed for a given failure context. We have implemented Log(Lock) in a real enterprise storage controller. Our experimental evaluation shows that Log(Lock)-enabled micro-recovery is efficient. It imposes