ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Advanced compiler design and implementation
Advanced compiler design and implementation
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Will C++ Be Faster than Fortran?
ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
The Role of Abstraction in High-Performance Computing
ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
A Generic C++ Framework for Parallel Mesh-Based Scientific Applications
HIPS '01 Proceedings of the 6th International Workshop on High-Level Parallel Programming Models and Supportive Environments
Concept-Based Component Libraries and Optimizing Compilers
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
On Materializations of Array-Valued Temporaries
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Concept Use or Concept Refinement: An Important Distinction in Building Generic Specifications
ICFEM '02 Proceedings of the 4th International Conference on Formal Engineering Methods: Formal Methods and Software Engineering
Delayed Evaluation, Self-optimising Software Components as a Programming Model
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
User-Extensible SimplificationType-Based Optimizer Generators
CC '01 Proceedings of the 10th International Conference on Compiler Construction
An Environment for Building Customizable Software Components
CD '02 Proceedings of the IFIP/ACM Working Conference on Component Deployment
A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers
Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Algorithm engineering: bridging the gap between algorithm theory and practice
Algorithm engineering: bridging the gap between algorithm theory and practice
Proceedings of the 20th ACM SIGPLAN workshop on Partial evaluation and program manipulation
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation
Science of Computer Programming
Efficient run-time dispatching in generic programming with minimal code bloat
Science of Computer Programming
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.00 |
We present a unified approach for building high-performance numerical linear algebra routines for large classes of dense and sparse matrices. As with the Standard Template Library [1], we separate algorithms from data structures using generic programming techniques. Such an approach does not hinder high performance; rather, writing portable high-performance codes is enabled because the performance-critical code can be isolated from the algorithms and data structures. We address the performance portability problem for architecture-dependent algorithms such as matrix-matrix multiply. Recently, code generation systems, such as PHiPAC [2] and ATLAS [3], have allowed algorithms to be tuned to particular architectures. Our approach is to use template metaprograms [4] to directly express performance-critical, architecture-dependent, sections of code.