Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Authors:
Jeremy D. Frens;David S. Wise
Affiliations:
Computer Science Dept., Indiana University, Bloomington, Indiana;Computer Science Dept., Indiana University, Bloomington, Indiana
Venue:
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1997

Citing 18
Cited 54

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Parallel algorithms for dense linear algebra computations

SIAM Review
Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
A gentle introduction to Haskell

ACM SIGPLAN Notices - Haskell special issue
Stability of block algorithms with fast level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Retire Fortran?: a debate rekindled

Communications of the ACM
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A model and compilation strategy for out-of-core data parallel programs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Symbolic mathematics system evaluators (extended abstract)

ISSAC '96 Proceedings of the 1996 international symposium on Symbolic and algebraic computation
LogP: a practical model of parallel computation

Communications of the ACM
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Storage reorganization techniques for matrix computation in a paging environment

Communications of the ACM
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO

ACM '77 Proceedings of the 1977 annual conference
Representing matrices as quadtrees for parallel processors: extended abstract

ACM SIGSAM Bulletin

Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Automatic parallelization of divide and conquer algorithms

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
Pthreads for dynamic and irregular parallelism

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Global static indexing for real-time exploration of very large regular grids

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Optimizing Graph Algorithms for Improved Cache Performance

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Recursion Unrolling for Divide and Conquer Programs

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Design-Driven Compilation

CC '01 Proceedings of the 10th International Conference on Compiler Construction
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Comparing Parallel Functional Languages: Programming and Performance

Higher-Order and Symbolic Computation
On improving the memory access patterns during the execution of Strassen's matrix multiplication algorithm

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion

ACM Transactions on Mathematical Software (TOMS)
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

ACM Transactions on Programming Languages and Systems (TOPLAS)
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Representation-transparent matrix algorithms with scalable performance

Proceedings of the 21st annual international conference on Supercomputing
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Evaluating ISA support and hardware support for recursive data layouts

HiPC'07 Proceedings of the 14th international conference on High performance computing
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Cache-oblivious polygon indecomposability testing

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Oracle scheduling: controlling granularity in implicitly parallel languages

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm

The Journal of Supercomputing
A study on load imbalance in parallel hypermatrix multiplication using OpenMP

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
A cache oblivious algorithm for matrix multiplication based on peano's space filling curve

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Implementing a numerical solution of the KPI equation using single assignment c: lessons and experiences

IFL'05 Proceedings of the 17th international conference on Implementation and Application of Functional Languages
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the super-scalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter future programmers from using this rich class of recursive algorithms.