A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Formal Methods for High-Performance Linear Algebra Libraries
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
BLAS Based on Block Data Structures
BLAS Based on Block Data Structures
A Flexible Class of Parallel Matrix Multiplication Algorithms
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Formal derivation of algorithms: The triangular sylvester equation
ACM Transactions on Mathematical Software (TOMS)
Architecture of an automatically tuned linear algebra library
Parallel Computing
High-performance linear algebra algorithms using new generalized data structures for matrices
IBM Journal of Research and Development
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
Remote parallel model reduction of linear time-invariant systems made easy
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
A family of high-performance matrix multiplication algorithms
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Performance of linear algebra code: intel xeon EM64T and ItaniumII case examples
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
SAR image reconstruction and autofocus by compressed sensing
Digital Signal Processing
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Hi-index | 0.00 |
During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software end-products of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach: using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that, ideally, one should pursue a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized inner-kernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.