A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Petal: distributed virtual disks
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
OceanStore: an architecture for global-scale persistent storage
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Raid Book: A Source Book for Raid Technology
Raid Book: A Source Book for Raid Technology
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Note: Correction to the 1997 tutorial on Reed–Solomon coding
Software—Practice & Experience - Research Articles
IBM intelligent Bricks project: petabytes and beyond
IBM Journal of Research and Development
WEAVER codes: highly fault tolerant erasure codes for storage systems
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
IBM intelligent Bricks project: petabytes and beyond
IBM Journal of Research and Development
Flipstone: managing storage with fail-in-place and deferred maintenance service models
ACM SIGOPS Operating Systems Review
Higher reliability redundant disk arrays: Organization, operation, and coding
ACM Transactions on Storage (TOS)
Hierarchical RAID: Design, performance, reliability, and recovery
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
A key objective of the IBM Intelligent Bricks project is to create a highly reliable system from commodity components. We envision such systems to be architected for a service model called fail-in-place or deferred maintenance. By delaying service actions, possibly for the entire lifetime of the system, management of the system is simplified. This paper examines the hardware reliability and deferred maintenance of intelligent storage brick (ISB) systems assuming a mesh-connected collection of bricks in which each brick includes processing power, memory, networking, and storage. On the basis of Monte Carlo simulations, we quantify the fraction of bricks that become unusable by a distributed data redundancy scheme due to degrading internal bandwidth and loss of external host connectivity. We derive a system hardware reliability expression and predict the length of time ISB systems can operate without replacement of failed bricks. We also show via a Markov analysis the level of fault tolerance that is required by the data redundancy scheme to achieve a goal of less than two data loss events per exabyte-year due to multiple failures.