An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Implementation of the Level 2 and 3 BLAS on the CRAY Y-MP and the CRAY-2
The Journal of Supercomputing
The Journal of Supercomputing
A parallel block implementation of Level-3 BLAS for MIMD vector processors
ACM Transactions on Mathematical Software (TOMS)
Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch
IBM Journal of Research and Development
GEMMW: a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm
Journal of Computational Physics
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues
ACM Transactions on Mathematical Software (TOMS)
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors
ACM Transactions on Mathematical Software (TOMS)
Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form
ACM Transactions on Mathematical Software (TOMS)
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
LAWRA: Linear Algebra with Recursive Algorithms
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Blocking Techniques in Numerical Software
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method
Journal of Computational Physics
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
A unified model for multicore architectures
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training
The Journal of Machine Learning Research
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using recursion to boost ATLAS's performance
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
ACM Transactions on Mathematical Software (TOMS)
The general matrix multiply-add operation on 2D torus
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Upper and lower I/O bounds for pebbling r-pyramids
Journal of Discrete Algorithms
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.01 |
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.