High-performance implementation of the level-3 BLAS

Authors:
Kazushige Goto;Robert Van De Geijn
Affiliations:
The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2008

Citing 6
Cited 28

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate

ACM Transactions on Mathematical Software (TOMS)
New algorithm to enable 400+ TFlop/s sustained performance in simulations of disorder effects in high-Tc superconductors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Attaining High Performance in General-Purpose Computations on Current Graphics Processors

High Performance Computing for Computational Science - VECPAR 2008
Block Kalman Filtering for Large-Scale DSGE Models

Computational Economics
Large-scale deep unsupervised learning using graphics processors

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS

ACM Transactions on Mathematical Software (TOMS)
Biomedical Case Studies in Data Intensive Computing

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A fast and robust mixed-precision solver for the solution of sparse symmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
Spatial relationship preserving character motion adaptation

ACM SIGGRAPH 2010 papers
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fine tuning matrix multiplications on multicore

HiPC'08 Proceedings of the 15th international conference on High performance computing
Bundle adjustment in the large

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
High-performance reconfigurable hardware architecture for restricted Boltzmann machines

IEEE Transactions on Neural Networks
Performance models for the Spike banded linear system solver

Scientific Programming
Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization

ACM Transactions on Mathematical Software (TOMS)
Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

Numerical Algorithms
Fast algorithms for floating-point interval matrix multiplication

Journal of Computational and Applied Mathematics
Fast static analysis of power grids: algorithms and implementations

Proceedings of the International Conference on Computer-Aided Design
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
Runtime detection and optimization of collective communication patterns

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-core scalability measurements: issues and solutions

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Interactive partner control in close interactions for real-time applications

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Exploiting vector instructions with generalized stream fusio

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Harmonic parameterization by electrostatics

ACM Transactions on Graphics (TOG)
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Journal of Parallel and Distributed Computing
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.