A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
MOB forms: a class of multilevel block algorithms for dense linear algebra operations
ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
DXML: a high-performance scientific subroutine library
Digital Technical Journal
Data prefetching and multilevel blocking for linear algebra operations
ICS '96 Proceedings of the 10th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
VECPAR '96 Selected papers from the Second International Conference on Vector and Parallel Processing
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Using non-canonical array layouts in dense matrix operations
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New level-3 BLAS kernels for cholesky factorization
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.01 |
The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.