Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms

Authors:
Matthew Badin;Paolo D'Alberto;Lubomir Bic;Michael Dillencourt;Alexandru Nicolau
Affiliations:
University of California Irvine, Irvine, CA, USA;FastMMW, Sunnyvale, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA;University of California Irvine, Irvine, CA, USA
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 12
Cited 0

Approximating matrix multiplication for pattern recognition tasks

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
A Distillation Algorithm for Floating-Point Summation

SIAM Journal on Scientific Computing
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms

SIAM Journal on Scientific Computing
Pretty Good Accuracy in Matrix Multiplication with GPUs

ISPDC '10 Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms

SBAC-PAD '11 Proceedings of the 2011 23rd International Symposium on Computer Architecture and High Performance Computing
Strassen's Matrix Multiplication on GPUs

ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific computing is only bound by the limits of Moore's Law and the scalability of high performance mathematical library implementations. Most mathematical libraries however tend to focus only on general inputs, limiting their potential performance and scalability by not tailoring their implementation to specific inputs, such as non-negative inputs. By removing this limitation it is possible to improve the performance and accuracy of a range of problems. In this paper we explore the limitations of hardware to improve accuracy of non-negative matrix multiply by specifically comparing implementations on the GPU and CPU and propose algorithmic solutions to improve accuracy. Next, we demonstrate a matrix multiply implementation that takes advantage of asymptotically fast matrix multiply algorithms, which have been shown to scale better than O(N3) matrix multiply implementations, and improve accuracy by up to a whole digit while increasing performance by up to 27% for matrices where the input is positive. Finally, we propose to extend the BLAS level 3 specification to non-negative matrices to allow easy integration of our solution and allow other library authors to implement their own solutions as part of an existing standard.