Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

Authors:
Paolo D'alberto;Marco Bodrato;Alexandru Nicolau
Affiliations:
Yahoo!, Sunnivale, CA;University of Rome II, Tor Vergata, Italy;University of California at Irvine, CA
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2011

Citing 30
Cited 1

How can we speed up matrix multiplication?

SIAM Review
Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Stability of block algorithms with fast level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
An introduction to parallel algorithms

An introduction to parallel algorithms
GEMMW: a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm

Journal of Computational Physics
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Algorithms for matrix multiplication

Algorithms for matrix multiplication
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication

Theoretical Computer Science - Algebraic and numerical algorithm
Minimizing development and maintenance costs in supporting persistently optimized BLAS

Software—Practice & Experience - Research Articles
Group-theoretic Algorithms for Matrix Multiplication

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations

SFCS '78 Proceedings of the 19th Annual Symposium on Foundations of Computer Science
Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages

ACM Transactions on Mathematical Software (TOMS)
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Techniques for efficient placement of synchronization primitives

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm

Proceedings of the 2009 international symposium on Symbolic and algebraic computation
A Strassen-like matrix multiplication suited for squaring and higher power computation

Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation

Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen's and Winograd's fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen's until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining. We have three contributions: first, we present a performance overview for double- and double-complex-precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd's matrix multiplication (achieving up to 20% speedup with respect to regular matrix multiplication). About the latter Winograd's algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how this yields up to 10% further performance improvements.