Automatically tuned linear algebra software

Authors:
R. Clint Whaley;Jack J. Dongarra
Affiliations:
Dept. of Computer Sciences, Univ. of TN, Knoxville, TN;Dept. of Computer Sciences, Univ. of TN, Knoxville, TN, and Mathematical Sciences Section, ORNL, Oak Ridge, TN
Venue:
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Year:
1998

Citing 5
Cited 171

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
ScaLAPACK user's guide

ScaLAPACK user's guide
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)

BIP-SMP: high performance message passing over a cluster of commodity SMPs

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
AJaPACK: experiments in performance portable parallel Java numerical libraries

Proceedings of the ACM 2000 conference on Java Grande
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Improving memory performance of sorting algorithms

Journal of Experimental Algorithmics (JEA)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers)

IEEE Transactions on Computers
Stochastic search for signal processing algorithm optimization

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Software Suite for High-Performance Communications on Clusters of SMPs

Cluster Computing
Reducing and Vectorizing Procedures for Telescoping Languages

International Journal of Parallel Programming
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Data Layout Optimizations for Variable Coefficient Multigrid

ICCS '02 Proceedings of the International Conference on Computational Science-Part III
Statistical Models for Automatic Performance Tuning

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Rescheduling for Locality in Sparse Matrix Computations

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Modal Model of Memory

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Optimizing Graph Algorithms for Improved Cache Performance

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Searching for the Best FFT Formulas with the SPL Compiler

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
A Matlab Just-In-time Compiler

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
LAWRA: Linear Algebra with Recursive Algorithms

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Performance Optimization of 3D Multigrid on Hierarchical Memory Architectures

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Heterogeneous Networks of Workstations and the Parallel Matrix Multiplication

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Active harmony: towards automated performance tuning

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
HARNESS fault tolerant MPI design, usage and performance issues

Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
ADAPT: Automated De-Coupled Adaptive Program Transformation

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Algorithm engineering for parallel computation

Experimental algorithmics
Mathematical software: past, present, and future

Computational science, mathematics and software
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Automatic code generation for a convection scheme

Proceedings of the 2003 ACM symposium on Applied computing
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
A framework for multi-execution performance tuning

On-line monitoring systems and computer tool interoperability
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Fast SVM Training Algorithm with Decomposition on Very Large Data Sets

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Automatic Type-Driven Library Generation for Telescoping Languages

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Using Phase Behavior in Scientific Application to Guide Linux Operating System Customization

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Towards an Accurate Model for Collective Communications

International Journal of High Performance Computing Applications
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Adaptive Strassen and ATLAS's DGEMM: A Fast Square-Matrix Multiply for Modern High-Performance Systems

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning

Proceedings of the International Symposium on Code Generation and Optimization
Automatic tuning of whole applications using direct search and a performance-based transformation system

The Journal of Supercomputing
A fast projected conjugate gradient algorithm for training support vector machines

Contemporary mathematics
Distribution of a class of divide and conquer recurrences arising from the computation of the Walsh-Hadamard transform

Theoretical Computer Science
Systems research challenges: a scale-out perspective

IBM Journal of Research and Development
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
Fast, automatic, procedure-level performance tuning

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Automatic performance model construction for the fast software exploration of new hardware designs

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
STAR-MPI: self tuned adaptive routines for MPI collective operations

Proceedings of the 20th annual international conference on Supercomputing
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
MPI performance analysis tools on Blue Gene/L

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cache-Friendly implementations of transitive closure

Journal of Experimental Algorithmics (JEA)
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
A comparison of online and offline strategies for program adaptation

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
A physically-based framework for real-time haptic cutting and interaction with 3D continuum models

Proceedings of the 2007 ACM symposium on Solid and physical modeling
Recursive approach in sparse matrix LU factorization

Scientific Programming
Parameterized tiled loops for free

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Loop Optimization using Hierarchical Compilation and Kernel Decomposition

Proceedings of the International Symposium on Code Generation and Optimization
Rapidly Selecting Good Compiler Optimizations using Performance Counters

Proceedings of the International Symposium on Code Generation and Optimization
BLASTH, a BLAS library for dual SMP computer

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
Energy-efficient channel estimation in MIMO systems

EURASIP Journal on Wireless Communications and Networking
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
PEAK—a fast and effective performance tuning system via compiler optimization orchestration

ACM Transactions on Programming Languages and Systems (TOPLAS)
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Cache efficient bidiagonalization using BLAS 2.5 operators

ACM Transactions on Mathematical Software (TOMS)
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Achieving accurate and context-sensitive timing for code optimization

Software—Practice & Experience
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Collective Optimization

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Quick and Practical Run-Time Evaluation of Multiple Program Optimizations

Transactions on High-Performance Embedded Architectures and Compilers I
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Parametric multi-level tiling of imperfectly nested loops

Proceedings of the 23rd international conference on Supercomputing
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Chameleon: adaptive selection of collections

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Operator Language: A Program Generation Framework for Fast Kernels

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Tuning parallel applications in parallel

Parallel Computing
A Holistic Approach towards Automated Performance Analysis and Tuning

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Compact multi-dimensional kernel extraction for register tiling

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Biomedical Case Studies in Data Intensive Computing

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Self-adapting software for numerical linear algebra library routines on clusters

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Memory hierarchy optimizations and performance bounds for sparse ATAx

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A compiler approach to performance prediction using empirical-based modeling

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Automatic creation of tile size selection models

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Parameterized tiling revisited

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Using non-canonical array layouts in dense matrix operations

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using recursion to boost ATLAS's performance

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Grid computing: experiment management, tool integration, and scientific workflows

Grid computing: experiment management, tool integration, and scientific workflows
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Parallel Colt: A High-Performance Java Library for Scientific Computing and Image Processing

ACM Transactions on Mathematical Software (TOMS)
Practical aggregation of semantical program properties for machine learning based optimization

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Collective optimization: A practical collaborative approach

ACM Transactions on Architecture and Code Optimization (TACO)
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Algorithmic issues in grid computing

Algorithms and theory of computation handbook
Accuracy and performance of single versus double precision arithmetics for maximum likelihood phylogeny reconstruction

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part II
Improving locality of nonserial polyadic dynamic programming

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A generic topology library

Science of Computer Programming
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Dynamic knobs for responsive power-aware computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Towards using and improving the NAS parallel benchmarks: a parallel patterns approach

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Tall and skinny QR factorizations in MapReduce architectures

Proceedings of the second international workshop on MapReduce and its applications
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Adapt or become extinct!: the case for a unified framework for deployment-time optimization (position paper)

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
An efficient evolutionary algorithm for solving incrementally structured problems

Proceedings of the 13th annual conference on Genetic and evolutionary computation
Auto-tuning full applications: A case study

International Journal of High Performance Computing Applications
Managing performance vs. accuracy trade-offs with loop perforation

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
An evaluation of different modeling techniques for iterative compilation

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Knowledge-based automatic generation of partitioned matrix expressions

CASC'11 Proceedings of the 13th international conference on Computer algebra in scientific computing
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Hardware performance monitoring for the rest of us: a position and survey

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
A survey of the practice of computational science

State of the Practice Reports
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Self-optimization of MPI applications within an autonomic framework

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
A study on load imbalance in parallel hypermatrix multiplication using OpenMP

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Fast algorithms for floating-point interval matrix multiplication

Journal of Computational and Applied Mathematics
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Empirical performance-model driven data layout optimization

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
JuliusC: a practical approach for the analysis of divide-and-conquer algorithms

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A code isolator: isolating code fragments from large programs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Deciding where to call performance libraries

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A paradigm for parallel matrix algorithms: scalable cholesky

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Optimization of a statically partitioned hypermatrix sparse cholesky factorization

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Effective source-to-source outlining to support whole program empirical optimization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Parameterized loop tiling

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimization for multi-thread data-flow software

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering
Language and compiler support for auto-tuning variable-accuracy algorithms

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Predictive modeling in a polyhedral optimization space

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
The babyblas - an extended project for introducing undergraduates to the concepts of high performance and parallel scientific computing

Journal of Computing Sciences in Colleges
Portable section-level tuning of compiler parallelized applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A multi-objective auto-tuning framework for parallel codes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Programming with relaxed synchronization

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Elemental: A New Framework for Distributed Memory Dense Matrix Computations

ACM Transactions on Mathematical Software (TOMS)
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A case study in mechanically deriving dense linear algebra code

International Journal of High Performance Computing Applications
Tools for machine-learning-based empirical autotuning and specialization

International Journal of High Performance Computing Applications
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations.