Anatomy of high-performance matrix multiplication

Authors:
Kazushige Goto;Robert A. van de Geijn
Affiliations:
The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2008

Citing 14
Cited 65

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Note On Parallel Matrix Inversion

SIAM Journal on Scientific Computing
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing

Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Scalable parallelization of FLAME code via the workqueuing model

ACM Transactions on Mathematical Software (TOMS)
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Cray XT4: an early evaluation for petascale scientific simulation

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Combining building blocks for parallel multi-level matrix multiplication

Parallel Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
BTL++: From Performance Assessment to Optimal Libraries

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
A unified model for multicore architectures

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

ACM Transactions on Mathematical Software (TOMS)
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
An evaluation of the TRIPS computer system

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS

ACM Transactions on Mathematical Software (TOMS)
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training

The Journal of Machine Learning Research
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fine tuning matrix multiplications on multicore

HiPC'08 Proceedings of the 15th international conference on High performance computing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A comparison of high-order time integrators for thermal convection in rotating spherical shells

Journal of Computational Physics
Programming the Linpack benchmark for Roadrunner

IBM Journal of Research and Development
The evaluation of American options in a stochastic volatility model with jumps: An efficient finite element approach

Computers & Mathematics with Applications
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach

WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
Matrices as arrows!: a biproduct approach to typed linear algebra

MPC'10 Proceedings of the 10th international conference on Mathematics of program construction
PARFES: A method for solving finite element linear equations on multi-core computers

Advances in Engineering Software
Experiences using hybrid MPI/OpenMP in the real world: Parallelization of a 3D CFD solver for multi-core node clusters

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Using simulation to design extremescale applications and architectures: programming model exploration

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A fast GEMM implementation on the cypress GPU

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Fast implementation of DGEMM on Fermi GPU

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring thread and memory placement on NUMA architectures: solaris and linux, UltraSPARC/FirePlane and opteron/hypertransport

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Upper and lower I/O bounds for pebbling r-pyramids

Journal of Discrete Algorithms
Column-Based matrix partitioning for parallel matrix multiplication on heterogeneous processors based on functional performance models

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice & Experience
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
SAR image reconstruction and autofocus by compressed sensing

Digital Signal Processing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Elemental: A New Framework for Distributed Memory Dense Matrix Computations

ACM Transactions on Mathematical Software (TOMS)
Multi-core scalability measurements: issues and solutions

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
Exploiting vector instructions with generalized stream fusio

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Typing linear algebra: A biproduct-oriented approach

Science of Computer Programming
Interfaces are key

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Building software environments for research computing clusters

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.