Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
The F-buffer: a rasterization-order FIFO buffer for multi-pass rendering
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Delay streams for graphics hardware
ACM SIGGRAPH 2003 Papers
Proceedings of the 31st annual international symposium on Computer architecture
Fast computation of database operations using graphics processors
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The Impact of Technology Scaling on Lifetime Reliability
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
IBM Journal of Research and Development
The Soft Error Problem: An Architectural Perspective
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Design and Evaluation of Hybrid Fault-Detection Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
The irregular Z-buffer: Hardware acceleration for irregular data structures
ACM Transactions on Graphics (TOG)
The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware
GH '06 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster
Parallel Computing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Using GPUs to improve multigrid solver performance on a cluster
International Journal of Computational Science and Engineering
Understanding software approaches for GPGPU reliability
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Soft error resilient QR factorization for hybrid system with GPGPU
Proceedings of the second workshop on Scalable algorithms for large-scale systems
iGPU: exception support and speculative execution on GPUs
Proceedings of the 39th Annual International Symposium on Computer Architecture
RISE: improving the streaming processors reliability against soft errors in gpgpus
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
CUDASA: compute unified device and systems architecture
EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
Proceedings of the 27th international ACM conference on International conference on supercomputing
Cost-effective soft-error protection for SRAM-based structures in GPGPUs
Proceedings of the ACM International Conference on Computing Frontiers
CPU-GPU hybrid bidiagonal reduction with soft error resilience
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Optimization power consumption model of reliability-aware GPU clusters
The Journal of Supercomputing
Hi-index | 0.00 |
General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.