ACM Transactions on Computer Systems (TOCS)
Reimplementing the Cedar file system using logging and group commit
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Fault Injection Experiments Using FIAT
IEEE Transactions on Computers
A formal approach to recovery by compensating transactions
Proceedings of the sixteenth international conference on Very large databases
The C programming language
Redundant disk arrays: reliable, parallel secondary storage
Redundant disk arrays: reliable, parallel secondary storage
The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults
IEEE Transactions on Software Engineering - Special issue on software reliability
The HP AutoRAID hierarchical storage system
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Using Z: specification, refinement, and proof
Using Z: specification, refinement, and proof
Practical loss-resilient codes
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering
Proceedings of the 24th annual international symposium on Computer architecture
An integrated congestion management architecture for Internet hosts
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Proceedings of the seventeenth ACM symposium on Operating systems principles
Model checking
Designing robust Java programs with exceptions
SIGSOFT '00/FSE-8 Proceedings of the 8th ACM SIGSOFT international symposium on Foundations of software engineering: twenty-first century applications
Pilot: an operating system for a personal computer
Communications of the ACM
Dynamic verification of operating system decisions
Communications of the ACM
Pointer analysis: haven't we solved this problem yet?
PASTE '01 Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An empirical study of operating systems errors
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Using aspectC to improve the modularity of path-specific customization in operating system code
Proceedings of the 8th European software engineering conference held jointly with 9th ACM SIGSOFT international symposium on Foundations of software engineering
Alloy: a lightweight object modelling notation
ACM Transactions on Software Engineering and Methodology (TOSEM)
Inside Windows NT
Practical File System Design with the Be File System
Practical File System Design with the Be File System
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Bridging the Information Gap in Storage Protocol Stacks
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Unifying File System Protection
Proceedings of the General Track: 2002 USENIX Annual Technical Conference
Detection of Defective Media in Disks
Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Measuring Fault Tolerance with the FTAPE Fault Injection Tool
MMB '95 Proceedings of the 8th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation: Quantitative Evaluation of Computing and Communication Systems
Error Scope on a Computational Grid: Theory and Practice
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
You Don't Know Jack about Disks
Queue - Storage
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Transforming policies into mechanisms with infokernel
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic detection and repair of errors in data structures
OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
Disk Scrubbing in Large Archival Storage Systems
MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Reliability and security of RAID storage systems and D2D archives using SATA disk drives
ACM Transactions on Storage (TOS)
Measuring Real-World Data Availability
LISA '01 Proceedings of the 15th USENIX conference on System administration
FS: An In-Kernel Integrity Checker and Intrusion Detection File System
LISA '04 Proceedings of the 18th USENIX conference on System administration
CMC: a pragmatic approach to model checking real code
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
DART: directed automated random testing
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Error Propagation Profiling of Operating Systems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Model-Based Failure Analysis of Journaling File Systems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Implementing declarative overlays
Proceedings of the twentieth ACM symposium on Operating systems principles
Proceedings of the twentieth ACM symposium on Operating systems principles
Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Automatically Generating Malicious Disks using Symbolic Execution
SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
A fresh look at the reliability of long-term digital storage
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Zodiac: efficient impact analysis for storage area networks
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Making system configuration more declarative
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Checking system rules using system-specific, programmer-written compiler extensions
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
System support for bandwidth management and content adaptation in internet applications
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Trading capacity for performance in a disk array
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Using model checking to find serious file system errors
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Deploying safe user-level network services with icTCP
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
File system design for an NFS file server appliance
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
USTC'94 Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
EXPLODE: a lightweight, general system for finding serious storage system errors
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Metadata update performance in file systems
OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
Journaling versus soft updates: asynchronous meta-data protection in file systems
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Towards availability benchmarks: a case study of software raid systems
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A five-year study of file-system metadata
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Scalability in the XFS file system
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Under-constrained execution: making automatic code destruction easy and scalable
Proceedings of the 2007 international symposium on Software testing and analysis
Improving file system reliability with I/O shepherding
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Parity lost and parity regained
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
EIO: error handling is occasionally correct
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Towards a next generation data center architecture: scalability and commoditization
Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow
Error propagation analysis for file systems
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
SQCK: a declarative file system checker
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Linux kernel developer responses to static analysis bug reports
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
More than an interface: scsi vs. ata
FAST'03 Proceedings of the 2nd USENIX conference on File and storage technologies
Row-diagonal parity for double disk failure correction
FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Chunkfs: using divide-and-conquer to improve file system reliability and repair
HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Hi-index | 0.00 |
Users are storing increasingly massive amounts of data. Storage software complexity is growing. The use of cheap and less reliable hardware is increasing. The combination of these trends presents us with a terrific challenge: How can we promise users that storage systems work robustly in spite of the complex failures that can arise? In the first part of this dissertation, we respond to this question with our analysis of three reliability components present in many modern file systems: the file system checker (fsck), failure detection and recovery policies (failure policy), and journaling. We find that these subsystems are deficient in handling partial disk failures: in the fsck analysis, we find that some repairs are buggy (making the repaired file system more corrupted) and some repairs are missing (leaving some corruptions unattended). In the failure policy analysis, we observe a major problem of diffused fault handling, which causes policies to be inconsistent, buggy, and inflexible to change. In the journaling analysis, we uncover that current journaling frameworks cannot recover from checkpoint write failures, and hence write failures are intentionally ignored. The results of our analysis hint that managing failures is hard (as also hinted by the developer's comment), and hence demand for novel solutions towards building more reliable storage systems. In the second part of this dissertation, we present our solutions to the problems above. First, we re-architect the file systemchecker by introducing SQCK, a robust file systemchecker that employs a declarative query language. By writing hundreds of checks and repairs in a query language (e.g., SQL), the high-level intent of the checker can be specified in a clear and compact manner. We show that SQCK is able to perform the same functionality as the Linux ext2/3 checker with elegant and compact queries. Second, we present EDP, a static analysis tool that shows how error codes flow through file systems and storage drivers. We observe that low-level errors are sometimes lost as they travel through the many layers of the storage subsystem: out of the 9022 function calls through which the analyzed error codes propagate, we find that 1153 calls (13%) do not correctly save the propagated error codes. Our detailed analysis shows that many violations are not corner-case mistakes; the return codes of some functions are consistently ignored. Finally, we present I/O shepherding, a new reliability infrastructure for file systems. With I/O shepherding, the reliability policies of a file system are well-defined, easy to understand, and simple to tailor to environment and workload. As part of this framework, we also introduce chained transactions, a novel and more powerful transactional model for checkpoint recoveries. We show that I/O shepherding enables simple, powerful, and correctly-implemented reliability policies by implementing an increasingly complex set of policies.