Automating the generation of composed linear algebra kernels

Authors:
Geoffrey Belter;E. R. Jessup;Ian Karlin;Jeremy G. Siek
Affiliations:
University of Colorado;University of Colorado;University of Colorado;University of Colorado
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 36
Cited 6

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The automatic generation of sparse primitives

ACM Transactions on Mathematical Software (TOMS)
High-level semantic optimization of numerical codes

ICS '99 Proceedings of the 13th international conference on Supercomputing
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
An APL Compiler for a Vector Processor

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A framework for sparse matrix code synthesis from high-level specifications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
FALCON: A MATLAB Interactive Restructuring Compiler

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Tuning High Performance Kernels through Empirical Compilation

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Think globally, search locally

Proceedings of the 19th annual international conference on Supercomputing
On Improving Linear Solver Performance: A Block Variant of GMRES

SIAM Journal on Scientific Computing
Using Machine Learning to Focus Iterative Optimization

Proceedings of the International Symposium on Code Generation and Optimization
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Mechanical derivation and systematic analysis of correct linear algebra algorithms

Mechanical derivation and systematic analysis of correct linear algebra algorithms
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Cache efficient bidiagonalization using BLAS 2.5 operators

ACM Transactions on Mathematical Software (TOMS)
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Automatic tuning of scientific applications

Automatic tuning of scientific applications
Effective automatic parallelization and locality optimization using the polyhedral model

Effective automatic parallelization and locality optimization using the polyhedral model
Memory minimization for tensor contractions using integer linear programming

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Analytic models and empirical search: a hybrid approach to code optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing

Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Spiral in scala: towards the systematic construction of generators for performance libraries

Proceedings of the 12th international conference on Generative programming: concepts & experiences
Application-tailored linear algebra algorithms: A search-based approach

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domain-specific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130% relative to the GotoBLAS on an AMD Opteron and up to 137% relative to MKL on an Intel Core 2.