A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Authors:
Jeremy W. Sheaffer;David P. Luebke;Kevin Skadron
Affiliations:
University of Virginia;NVIDIA Research;University of Virginia
Venue:
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Year:
2007

Citing 16
Cited 13

Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
The F-buffer: a rasterization-order FIFO buffer for multi-pass rendering

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
Delay streams for graphics hardware

ACM SIGGRAPH 2003 Papers
A Complexity-Effective Approach to ALU Bandwidth Enhancement for Instruction-Level Temporal Redundancy

Proceedings of the 31st annual international symposium on Computer architecture
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The Impact of Technology Scaling on Lifetime Reliability

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors

IBM Journal of Research and Development
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Design and Evaluation of Hybrid Fault-Detection Systems

Proceedings of the 32nd annual international symposium on Computer Architecture
The irregular Z-buffer: Hardware acceleration for irregular data structures

ACM Transactions on Graphics (TOG)
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware

GH '06 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Transient-Fault Recovery for Chip Multiprocessors

IEEE Micro

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

Parallel Computing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
Understanding software approaches for GPGPU reliability

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Soft error resilient QR factorization for hybrid system with GPGPU

Proceedings of the second workshop on Scalable algorithms for large-scale systems
iGPU: exception support and speculative execution on GPUs

Proceedings of the 39th Annual International Symposium on Computer Architecture
RISE: improving the streaming processors reliability against soft errors in gpgpus

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
CUDASA: compute unified device and systems architecture

EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
Cost-effective soft-error protection for SRAM-based structures in GPGPUs

Proceedings of the ACM International Conference on Computing Frontiers
CPU-GPU hybrid bidiagonal reduction with soft error resilience

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Optimization power consumption model of reliability-aware GPU clusters

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.