A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Characterizing application sensitivity to OS interference using kernel-level noise injection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
AIREN: A Novel Integration of On-Chip and Off-Chip FPGA Networks
FCCM '09 Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hi-index | 0.00 |
As the largest computers in world continue to scale up, significant new problems emerge. Due to a number of different sources, hard and soft failures cause correctness failures and performance degradation in high component count machines that simply do not occur is smaller scale machines. This presents a serious problem to researchers interested in studying and countering the effects because effects are not reproducible smaller scale testbeds. The Reconfigurable Computing Cluster project is primarily interested in the feasibility of scaling a network of Platform FPGAs to a PetaFLOP. In this architecture, the FPGAs are basic compute node (not an accelerator to a microprocessor). Each FPGA node has a processor, bus, memory controller, network adapter, etc. configured into its programmable logic and runs a patched mainline Linux kernel. The cluster runs Open MPI and virtually any MPI application can be compiled and run. Since the hardware of the nodes are easily reconfigured, it was realized that this architecture offers an ideal platform for performing fault injections and Heisenberg-free monitoring of parallel programs. This work-in-progress demonstrates the addition of a "bus cycle stealer." It is a small piece of hardware that can be programmed to wake up periodically and execute a variable number bus transactions. The effect on performance (single node and 16 node systems) are reported.