Soft error resilient QR factorization for hybrid system with GPGPU
Proceedings of the second workshop on Scalable algorithms for large-scale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Commercial graphics processing units (GPUs) prove their attractive, inexpensive in high performance scientific applications. However, a recent research through Folding@home demonstrates that two-thirds of tested GPUs on Folding@home exhibit a detectable, pattern-sensitive rate of memory soft errors for GPGPU. Fault tolerance has been viewed as critical to the effective use of these GPUs. In this paper, we present an on-line GPU error detection, location, and correction method to incorporate fault tolerance into matrix multiplication. The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs. The proposed on-line fault tolerance mechanism detects soft errors in the middle of the computation so that better reliability can be achieved by correcting corrupted computations in time. Experimental results demonstrate that the proposed method is highly efficient.