Tuning Strassen's matrix multiplication for memory efficiency

Authors:
Mithuna Thottethodi;Siddhartha Chatterjee;Alvin R. Lebeck
Affiliations:
Duke University, Durham, NC;The University of North Carolina, Chapel Hill, NC;Duke University, Durham, NC
Venue:
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Year:
1998

Citing 18
Cited 32

Extra high speed matrix multiplication on the Cray-2

SIAM Journal on Scientific and Statistical Computing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK's user's guide

LAPACK's user's guide
ATOM: a system for building customized program analysis tools

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
GEMMW: a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm

Journal of Computational Physics
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves

IEEE Transactions on Parallel and Distributed Systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
Efficient Procedures for Using Matrix Algorithms

Proceedings of the 2nd Colloquium on Automata, Languages and Programming
Experiments with Quadtree Representation of Matrices

ISAAC '88 Proceedings of the International Symposium ISSAC'88 on Symbolic and Algebraic Computation

Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Irregularity in multi-dimensional space-filling curves with applications in multimedia databases

Proceedings of the tenth international conference on Information and knowledge management
Fast context-free grammar parsing requires fast boolean matrix multiplication

Journal of the ACM (JACM)
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Fast Automatic Generation of DSP Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Mixed Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
General Matrix-Matrix Multiplication Using SIMD Features of the PIII (Research Note)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Analysis of Multi-Dimensional Space-Filling Curves

Geoinformatica
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
On improving the memory access patterns during the execution of Strassen's matrix multiplication algorithm

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
The aggregation and cancellation techniques as a practical tool for faster matrix multiplication

Theoretical Computer Science - Algebraic and numerical algorithm
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Irregularity in high-dimensional space-filling curves

Distributed and Parallel Databases
A data locality methodology for matrix---matrix multiplication algorithm

The Journal of Supercomputing
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non-standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness.Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.