Random number generators: good ones are hard to find
Communications of the ACM
Tutorial on semiconductor memory testing
Journal of Electronic Testing: Theory and Applications - Special issue: on memory testing
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Self-Checking Detection and Diagnosis of Transient, Delay, and Crosstalk Faults Affecting Bus Lines
IEEE Transactions on Computers
IOLTW '01 Proceedings of the Seventh International On-Line Testing Workshop
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware
GH '06 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
Single-event-upset and alpha-particle emission rate measurement techniques
IBM Journal of Research and Development
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Folding@home: Lessons from eight years of volunteer distributed computing
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Soft error resilient QR factorization for hybrid system with GPGPU
Proceedings of the second workshop on Scalable algorithms for large-scale systems
RISE: improving the streaming processors reliability against soft errors in gpgpus
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
CPU-GPU hybrid bidiagonal reduction with soft error resilience
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Optimization power consumption model of reliability-aware GPU clusters
The Journal of Supercomputing
Hi-index | 0.01 |
Graphics processing units (GPUs) are gaining widespread use in high-performance computing because of their performance advantages relative to CPUs. However, the reliability of GPUs is largely unproven. In particular, current GPUs lack error checking and correcting (ECC) in their memory subsystems. The impact of this design has not been previously measured at a large enough scale to quantify soft error events. We present MemtestG80, our software for assessing memory error rates on NVIDIA graphics cards. Furthermore, we present a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 50,000 hosts on the Folding@home distributed computing network. Our control experiments on consumer-grade and dedicated-GPGPU hardware in a controlled environment found no errors. However, our survey on Folding@home finds that, in their installed environments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitive rate of memory soft errors. We show that these errors persist after controlling for over clocking and environmental proxies for temperature, but depend strongly on board architecture.