Availability analysis of blade server systems

Authors:
W. E. Smith;K. S. Trivedi;L. A. Tomek;J. Ackaret
Affiliations:
IBM Systems and Technology Group, NC;Department of Electrical and Computer Engineering, Pratt School of Engineering, Duke University, Durham NC;IBM Systems and Technology Group, NC;IBM Systems and Technology Group, Beaverton, OR
Venue:
IBM Systems Journal
Year:
2008

Citing 19
Cited 3

Bounding Availability of Repairable Systems

IEEE Transactions on Computers
High-Availability Computer Systems

Computer
Dependability modeling of a heterogeneous VAX-cluster system using stochastic reward nets

Hardware and software fault tolerance in parallel computing systems
The UltraSAN modeling environment

Performance Evaluation - Special issue: performance modeling tools
Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Modeling and analysis of stochastic systems

Modeling and analysis of stochastic systems
Blueprints for high availability: designing resilient distributed systems

Blueprints for high availability: designing resilient distributed systems
Bound Computation of Dependability and Performance Measures

IEEE Transactions on Computers
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
The Möbius Framework and Its Implementation

IEEE Transactions on Software Engineering
Fixed Point Iteration in Availability Modeling

Proceedings of the 5th International GI/ITG/GMA Conference on Fault-Tolerant Computing Systems, Tests, Diagnosis, Fault Treatment
Automatic Generation of Availability Models in RAScad

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Hierarchical Computation of Interval Availability and Related Metrics

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Model-Based Evaluation: From Dependability to Security

IEEE Transactions on Dependable and Secure Computing
Modeling High Availability

PRDC '06 Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing
Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate

Computer

SHARPE at the age of twenty two

ACM SIGMETRICS Performance Evaluation Review
Achieving and assuring high availability

ISAS'08 Proceedings of the 5th international conference on Service availability
Automatic synthesis of SRN models from system operation templates for availability analysis

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security

Quantified Score

Hi-index	0.00

Visualization

Abstract

The successful development and marketing of commercial high-availability systems requires the ability to evaluate the availability of systems. Specifically, one should be able to demonstrate that projected customer requirements are met, to identify availability bottlenecks, to evaluate and compare different configurations, and to evaluate and compare different designs. For evaluation approaches based on analytic modeling, these systems are often sufficiently complex so that state-space methods are not effective due to the large number of states, whereas combinatorial methods are inadequate for capturing all significant dependencies. The two-level hierarchical decomposition proposed here is suitable for the availability modeling of blade server systems such as IBM BladeCenter®, a commercial, high-availability multicomponent system comprising up to 14 separate blade servers and contained within a chassis that provides shared subsystems such as power and cooling. This approach is based on an availability model that combines a high-level fault tree model with a number of lower-level Markov models. It is used to determine component level contributions to downtime as well as steady-state availability for both standalone and clustered blade servers. Sensitivity of the results to input parameters is examined, extensions to the models are described, and availability bottlenecks and possible solutions are identified.