Minimizing development and maintenance costs in supporting persistently optimized BLAS

Authors:
R. Clint Whaley;Antoine Petitet
Affiliations:
Computer Science Department, Florida State University, 167 Love Building, Tallahassee, FL 32306-4530, U.S.A.;SUN Microsystems, 42, Avenue d'Iena, 75016 Paris, France
Venue:
Software—Practice & Experience - Research Articles
Year:
2005

Citing 0
Cited 37

High performance BLAS formulation of the multipole-to-local operator in the fast multipole method

Journal of Computational Physics
Adaptive Loop Tiling for a Multi-cluster CMP

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
BTL++: From Performance Assessment to Optimal Libraries

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
A Tool for Optimizing Runtime Parameters of Open MPI

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Automated transformation for performance-critical kernels

LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Bandit-based optimization on graphs with application to library performance tuning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Adaptive Application Composition in Quantum Chemistry

QoSA '09 Proceedings of the 5th International Conference on the Quality of Software Architectures: Architectures for Adaptive Software Systems
Parallel expression template for large vectors

Proceedings of the 8th workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Optimal block-tridiagonalization of matrices for coherent charge transport

Journal of Computational Physics
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
MatForce: supporting rapid algorithm development by automated translation of MatLab prototypes into C++

SE '08 Proceedings of the IASTED International Conference on Software Engineering
Automatic creation of tile size selection models

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Sparse matrix algebra for quantum modeling of large systems

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Multiparadigm programming in object-oriented languages: current research report on the workshop MPOOL'07 at ECOOP 2007

ECOOP'07 Proceedings of the 2007 conference on Object-oriented technology
Calculating near-singular eigenvalues of the neutron transport operator with arbitrary order anisotropic scattering

Mathematics and Computers in Simulation
Measuring execution times of collective communications in an empirical optimization framework

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Exact solutions to linear systems of equations using output sensitive lifting

ACM Communications in Computer Algebra
Co-synthesis of FPGA-based application-specific floating point simd accelerators

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Blendenpik: Supercharging LAPACK's Least-Squares Solver

SIAM Journal on Scientific Computing
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Two implementations of the preconditioned conjugate gradient method on heterogeneous computing grids

International Journal of Applied Mathematics and Computer Science - Computational Intelligence in Modern Control Systems
Array-Structured object types for mathematical programming

JMLC'06 Proceedings of the 7th joint conference on Modular Programming Languages
Shifting the stage: Staging with delimited control

Journal of Functional Programming
Efficient implementation of interval matrix multiplication

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
High performance BLAS formulation of the adaptive Fast Multipole Method

Mathematical and Computer Modelling: An International Journal
Explicitly heterogeneous metaprogramming with MetaHaskell

Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
High-performance dynamic quantum clustering on graphics processors

Journal of Computational Physics
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Optimizing performance of automatic training phase for application performance prediction in the grid

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Automated Comparison of State-Based Software Models in Terms of Their Language and Structure

ACM Transactions on Software Engineering and Methodology (TOSEM)
Terra: a multi-stage language for high-performance computing

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
Scalable multimedia content analysis on parallel platforms using python

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines. Copyright © 2004 John Wiley & Sons, Ltd.