CPU-GPU hybrid bidiagonal reduction with soft error resilience

Authors:
Yulu Jia;Piotr Luszczek;George Bosilca;Jack J. Dongarra
Affiliations:
University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville;University of Tennessee Knoxville and Oak Ridge National Laboratory and University of Manchester
Venue:
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Year:
2013

Citing 9
Cited 0

LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Understanding software approaches for GPGPU reliability

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Analyzing soft-error vulnerability on GPGPU microarchitecture

IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Cost-effective soft-error protection for SRAM-based structures in GPGPUs

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur.