LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Understanding software approaches for GPGPU reliability
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Analyzing soft-error vulnerability on GPGPU microarchitecture
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Cost-effective soft-error protection for SRAM-based structures in GPGPUs
Proceedings of the ACM International Conference on Computing Frontiers
Hi-index | 0.00 |
Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.