Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation

  • Authors:
  • Zbigniew Kalbarczyk;Ravishankar K. Iyer;Gregory L. Ries;Jaqdish U. Patel;Myeong S. Lee;Yuxiao Xiao

  • Affiliations:
  • Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois at Urbana-Champaign, Urbana;ATI Research Silicon Valley, Inc., Santa Clara, CA;NASA Jet Propulsion Lab, Pasadena, CA;Hewlett-Packard, Cupertino, CA;Strategy, Vienna, VA

  • Venue:
  • IEEE Transactions on Software Engineering
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a hierarchical simulation methodology that enables accurate system evaluation under realistic faults and conditions. In this methodology, effects of low-level (i.e., transistor or circuit level) faults are propagated to higher levels (i.e., system level) using fault dictionaries. The primary fault models are obtained via simulation of the transistor-level effect of a radiation particle penetrating a device. The resulting current bursts constitute the first-level fault dictionary and are used in the circuit-level simulation to determine the impact on circuit latches and flip-flops. The latched outputs constitute the next level fault dictionary in the hierarchy and are applied in conducting fault injection simulation at the chip-level under selected workloads or application programs. Faults injected at the chip-level result in memory corruptions, which are used to form the next level fault dictionary for the system-level simulation of an application running on simulated hardware. When an application terminates, either normally or abnormally, the overall fault impact on the software behavior is quantified and analyzed. The system in this sense can be a single workstation or a network. The simulation method is demonstrated and validated in the case study of Myrinet (a commercial, high-speed network) based network system. The study shows that the method: 1) allows detailed simulation of faults at lower levels and effective fault propagation through the system to the user-visible higher levels using fault dictionaries, 2) links physical faults with effects that the user can observe at the higher levels and thus provides a foundation for realistic fault injection studies, 3) allows significant reduction in the number of simulations needed, due to the fault dictionary method, 4) offers a high confidence in the evaluation results because the system is analyzed in presence of realistic fault conditions, and 5) provides valuable feedback for designing error recovery mechanisms to improve dependability.