Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
A study of slipstream processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
ED4I: Error Detection by Diverse Data and Duplicated Instructions
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
A Fault Tolerant Approach to Microprocessor Design
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Subtleties in tolerating correlated failures in wide-area storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design
IEEE Transactions on Computers
Adaptive Fault Management of Parallel Applications for High-Performance Computing
IEEE Transactions on Computers
Development of on-board space computer systems
IBM Journal of Research and Development
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.