Towards a hardware fault-injection testbed to support reproducible resiliency experiments

  • Authors:
  • Ron Sass;Rahul R. Sharma;Nathan DeBardeleben

  • Affiliations:
  • University of North Carolina at Charlotte, Charlotte, NC, USA;University of North Carolina at Charlotte, Charlotte, NC, USA;Los Alamos National Laboratory, Los Alamos, NM, USA

  • Venue:
  • Proceedings of the 2009 workshop on Resiliency in high performance
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the largest computers in world continue to scale up, significant new problems emerge. Due to a number of different sources, hard and soft failures cause correctness failures and performance degradation in high component count machines that simply do not occur is smaller scale machines. This presents a serious problem to researchers interested in studying and countering the effects because effects are not reproducible smaller scale testbeds. The Reconfigurable Computing Cluster project is primarily interested in the feasibility of scaling a network of Platform FPGAs to a PetaFLOP. In this architecture, the FPGAs are basic compute node (not an accelerator to a microprocessor). Each FPGA node has a processor, bus, memory controller, network adapter, etc. configured into its programmable logic and runs a patched mainline Linux kernel. The cluster runs Open MPI and virtually any MPI application can be compiled and run. Since the hardware of the nodes are easily reconfigured, it was realized that this architecture offers an ideal platform for performing fault injections and Heisenberg-free monitoring of parallel programs. This work-in-progress demonstrates the addition of a "bus cycle stealer." It is a small piece of hardware that can be programmed to wake up periodically and execute a variable number bus transactions. The effect on performance (single node and 16 node systems) are reported.