Towards a hardware fault-injection testbed to support reproducible resiliency experiments

Authors:
Ron Sass;Rahul R. Sharma;Nathan DeBardeleben
Affiliations:
University of North Carolina at Charlotte, Charlotte, NC, USA;University of North Carolina at Charlotte, Charlotte, NC, USA;Los Alamos National Laboratory, Los Alamos, NM, USA
Venue:
Proceedings of the 2009 workshop on Resiliency in high performance
Year:
2009

Citing 4
Cited 1

A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
AIREN: A Novel Integration of On-Chip and Off-Chip FPGA Networks

FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines

Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the largest computers in world continue to scale up, significant new problems emerge. Due to a number of different sources, hard and soft failures cause correctness failures and performance degradation in high component count machines that simply do not occur is smaller scale machines. This presents a serious problem to researchers interested in studying and countering the effects because effects are not reproducible smaller scale testbeds. The Reconfigurable Computing Cluster project is primarily interested in the feasibility of scaling a network of Platform FPGAs to a PetaFLOP. In this architecture, the FPGAs are basic compute node (not an accelerator to a microprocessor). Each FPGA node has a processor, bus, memory controller, network adapter, etc. configured into its programmable logic and runs a patched mainline Linux kernel. The cluster runs Open MPI and virtually any MPI application can be compiled and run. Since the hardware of the nodes are easily reconfigured, it was realized that this architecture offers an ideal platform for performing fault injections and Heisenberg-free monitoring of parallel programs. This work-in-progress demonstrates the addition of a "bus cycle stealer." It is a small piece of hardware that can be programmed to wake up periodically and execute a variable number bus transactions. The effect on performance (single node and 16 node systems) are reported.