Adaptive Strassen's matrix multiplication

Authors:
Paolo D'Alberto;Alexandru Nicolau
Affiliations:
Yahoo! Sunnyvale, CA;University of California, Irvine
Venue:
Proceedings of the 21st annual international conference on Supercomputing
Year:
2007

Citing 20
Cited 6

How can we speed up matrix multiplication?

SIAM Review
Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Using Strassen's algorithm to accelerate the solution of linear systems

The Journal of Supercomputing
Stability of block algorithms with fast level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Algorithms for matrix multiplication

Algorithms for matrix multiplication
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication

Theoretical Computer Science - Algebraic and numerical algorithm
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems

Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Matrices as arrows!: a biproduct approach to typed linear algebra

MPC'10 Proceedings of the 10th international conference on Mathematics of program construction
Optimization techniques for small matrix multiplication

Theoretical Computer Science
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Typing linear algebra: A biproduct-oriented approach

Science of Computer Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However, for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen's algorithm (largely) useless on its own. In this paper, we investigate the interaction between Strassen's effective performance and the memory-hierarchy organization. We show how to exploit Strassen's full potential across different architectures. We present an easy-to-use adaptive algorithm that combines a novel implementation of Strassen's idea with the MM from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies to any size and shape matrices and works equally well with row or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that estimates whether or not we can achieve any additional speedup using our Strassen's adaptation algorithm. Then we install our codes, validate our estimates, and determine the specific performance. We show that, by the right combination of Strassen's with ATLAS/GotoBLAS, our approach achieves up to 30%/22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems.