High performance BLAS formulation of the multipole-to-local operator in the fast multipole method
Journal of Computational Physics
Adaptive Loop Tiling for a Multi-cluster CMP
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
BTL++: From Performance Assessment to Optimal Libraries
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
A Tool for Optimizing Runtime Parameters of Open MPI
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Automated transformation for performance-critical kernels
LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Bandit-based optimization on graphs with application to library performance tuning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Adaptive Application Composition in Quantum Chemistry
QoSA '09 Proceedings of the 5th International Conference on the Quality of Software Architectures: Architectures for Adaptive Software Systems
Parallel expression template for large vectors
Proceedings of the 8th workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Optimal block-tridiagonalization of matrices for coherent charge transport
Journal of Computational Physics
Autotuning multigrid with PetaBricks
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
SE '08 Proceedings of the IASTED International Conference on Software Engineering
Automatic creation of tile size selection models
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Sparse matrix algebra for quantum modeling of large systems
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
ECOOP'07 Proceedings of the 2007 conference on Object-oriented technology
Mathematics and Computers in Simulation
Measuring execution times of collective communications in an empirical optimization framework
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Exact solutions to linear systems of equations using output sensitive lifting
ACM Communications in Computer Algebra
Co-synthesis of FPGA-based application-specific floating point simd accelerators
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Blendenpik: Supercharging LAPACK's Least-Squares Solver
SIAM Journal on Scientific Computing
Two implementations of the preconditioned conjugate gradient method on heterogeneous computing grids
International Journal of Applied Mathematics and Computer Science - Computational Intelligence in Modern Control Systems
Array-Structured object types for mathematical programming
JMLC'06 Proceedings of the 7th joint conference on Modular Programming Languages
Shifting the stage: Staging with delimited control
Journal of Functional Programming
Efficient implementation of interval matrix multiplication
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
High performance BLAS formulation of the adaptive Fast Multipole Method
Mathematical and Computer Modelling: An International Journal
Explicitly heterogeneous metaprogramming with MetaHaskell
Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
High-performance dynamic quantum clustering on graphics processors
Journal of Computational Physics
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Automated Comparison of State-Based Software Models in Terms of Their Language and Structure
ACM Transactions on Software Engineering and Methodology (TOSEM)
Terra: a multi-stage language for high-performance computing
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Scalable multimedia content analysis on parallel platforms using python
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Hi-index | 0.01 |
The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance-critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance-critical routines. Copyright © 2004 John Wiley & Sons, Ltd.