Statistical Models for Empirical Search-Based Performance Tuning

Authors:
Richard Vuduc;James W. Demmel;Jeff A. Bilmes
Affiliations:
Computer Science Division Department of Electrical Engineering and Computer Sciences University of California at Berkeley, Berkeley, CA 94720, USA;Computer Science Division Department of Electrical Engineering and Computer Sciences and Department of Mathematics University of California at Berkeley, Berkeley, CA 94720, USA;Department of Electrical Engineering University of Washington, Seattle, WA, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2004

Citing 67
Cited 21

Superoptimizer: a look at the smallest program

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Automated selection of mathematical software

ACM Transactions on Mathematical Software (TOMS)
Using profile information to assist classic code optimizations

Software—Practice & Experience
Eliminating branches using a superoptimizer and the GNU C compiler

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
High-level optimization via automated statistical modeling

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The nature of statistical learning theory

The nature of statistical learning theory
Advanced compiler optimizations for sparse computations

Journal of Parallel and Distributed Computing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient path profiling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic feedback: an effective technique for adaptive computing

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A relational approach to the automatic generation of sequential sparse matrix codes

A relational approach to the automatic generation of sequential sparse matrix codes
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Graphical models for machine learning and digital communication

Graphical models for machine learning and digital communication
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An adaptive software library for fast Fourier transforms

Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Overcoming the challenges to feedback-directed optimization (Keynote Talk)

DYNAMO '00 Proceedings of the ACM SIGPLAN workshop on Dynamic and adaptive compilation and optimization
PYTHIA-II: a knowledge/database system for managing performance data and recommending scientific software

ACM Transactions on Mathematical Software (TOMS) - Special issue in honor of John Rice's 65th birthday
Note on generalization in experimental algorithmics

ACM Transactions on Mathematical Software (TOMS)
Adaptive optimization in the Jalapeño JVM (poster session)

OOPSLA '00 Addendum to the 2000 proceedings of the conference on Object-oriented programming, systems, languages, and applications (Addendum)
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
The hardness of cache conscious data placement

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Adaptive Optimizing Compilers for the 21st Century

The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
A Modal Model of Memory

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Fast Automatic Generation of DSP Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Algorithm Selection using Reinforcement Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Runtime Interprocedural Data Placement Optimisation for Lazy Parallel Libraries (Extended Abstract)

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Delayed Evaluation, Self-optimising Software Components as a Programming Model

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
MPI-2: Extending the Message-Passing Interface

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
Extending the Hong-Kung Model to Memory Hierarchies

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library

ECOOP '98 Workshop ion on Object-Oriented Technology
On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Active harmony: towards automated performance tuning

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Compiler optimization-space exploration

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Continuous program optimization: A case study

ACM Transactions on Programming Languages and Systems (TOPLAS)
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Meta optimization: improving compiler heuristics with machine learning

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Automatic Analytical Modeling for the Estimation of Cache Misses

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
A Statistical Approach for the Analysis of the Relation Between Low-Level Performance Information, the Code, and the Environment

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
ADAPT: Automated De-Coupled Adaptive Program Transformation

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The PHiPAC v1.0 Matrix-Multiply Distribution

The PHiPAC v1.0 Matrix-Multiply Distribution
I/O-Efficient Algorithms for Problems on Grid-Based Terrains

Journal of Experimental Algorithmics (JEA)
Automatic benchmark generation for cache optimization of matrix operations

ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science

A framework for adaptive algorithm selection in STAPL

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reduction Transformations for Optimization Parameter Selection

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
SmartApps: middle-ware for adaptive applications on reconfigurable platforms

ACM SIGOPS Operating Systems Review
A comparison of online and offline strategies for program adaptation

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Rapidly Selecting Good Compiler Optimizations using Performance Counters

Proceedings of the International Symposium on Code Generation and Optimization
MPI collective algorithm selection and quadtree encoding

Parallel Computing
Exploring and predicting the architecture/optimising compiler co-design space

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Automated transformation for performance-critical kernels

LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Data mining for simulation algorithm selection

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Adaptive Application Composition in Quantum Chemistry

QoSA '09 Proceedings of the 5th International Conference on the Quality of Software Architectures: Architectures for Adaptive Software Systems
An Efficient and Adaptive Mechanism for Parallel Simulation Replication

PADS '09 Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation
Portable compiler optimisation across embedded programs and microarchitectures using machine learning

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Automating the runtime performance evaluation of simulation algorithms

Winter Simulation Conference
Probabilistic auto-tuning for architectures with complex constraints

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Efficiently exploring compiler optimization sequences with pairwise pruning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Selecting Simulation Algorithm Portfolios by Genetic Algorithms

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Exploring and Predicting the Effects of Microarchitectural Parameters and Compiler Optimizations on Performance and Energy

ACM Transactions on Embedded Computing Systems (TECS)
Predictive modeling in a polyhedral optimization space

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations

Software—Practice & Experience
Decision trees and MPI collective algorithm selection problem

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Towards making autotuning mainstream

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e. actually running the code). This paper presents quantitative data that motivate the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found. Secondly, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations when the space of inputs can be described by continuously varying features. We address both problems by using statistical modeling techniques that exploit the large amount of performance data collected during the search. We demonstrate these methods on actual performance data collected by the PHiPAC tuning system for dense matrix multiply. We close with a survey of recent projects that use or otherwise advocate an empirical search-based approach to code generation and algorithm selection, whether at the level of computational kernels, compiler and run-time systems, or problem-solving environments. Collectively, these efforts suggest a number of possible software architectures for constructing platform-adapted libraries and applications.