Reliability of modular mesh-connected intelligent storage brick systems

Authors:
C. Fleiner;R. B. Garner;J. L. Hafner;K. K. Rao;D. R. Kenchammana-Hosekote;W. W. Wilcke;J. S. Glider
Affiliations:
-;-;-;-;-;-;-
Venue:
IBM Journal of Research and Development
Year:
2006

Citing 11
Cited 4

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Raid Book: A Source Book for Raid Technology

Raid Book: A Source Book for Raid Technology
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design

Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
FAB: building distributed enterprise disk arrays from commodity components

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Note: Correction to the 1997 tutorial on Reed–Solomon coding

Software—Practice & Experience - Research Articles
IBM intelligent Bricks project: petabytes and beyond

IBM Journal of Research and Development
WEAVER codes: highly fault tolerant erasure codes for storage systems

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4

IBM intelligent Bricks project: petabytes and beyond

IBM Journal of Research and Development
Flipstone: managing storage with fail-in-place and deferred maintenance service models

ACM SIGOPS Operating Systems Review
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Hierarchical RAID: Design, performance, reliability, and recovery

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

A key objective of the IBM Intelligent Bricks project is to create a highly reliable system from commodity components. We envision such systems to be architected for a service model called fail-in-place or deferred maintenance. By delaying service actions, possibly for the entire lifetime of the system, management of the system is simplified. This paper examines the hardware reliability and deferred maintenance of intelligent storage brick (ISB) systems assuming a mesh-connected collection of bricks in which each brick includes processing power, memory, networking, and storage. On the basis of Monte Carlo simulations, we quantify the fraction of bricks that become unusable by a distributed data redundancy scheme due to degrading internal bandwidth and loss of external host connectivity. We derive a system hardware reliability expression and predict the length of time ISB systems can operate without replacement of failed bricks. We also show via a Markov analysis the level of fault tolerance that is required by the data redundancy scheme to achieve a goal of less than two data loss events per exabyte-year due to multiple failures.