Superoptimizer: a look at the smallest program
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Automated selection of mathematical software
ACM Transactions on Mathematical Software (TOMS)
Using profile information to assist classic code optimizations
Software—Practice & Experience
Eliminating branches using a superoptimizer and the GNU C compiler
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
High-level optimization via automated statistical modeling
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The nature of statistical learning theory
The nature of statistical learning theory
Advanced compiler optimizations for sparse computations
Journal of Parallel and Distributed Computing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic feedback: an effective technique for adaptive computing
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A relational approach to the automatic generation of sequential sparse matrix codes
A relational approach to the automatic generation of sequential sparse matrix codes
Locality of Reference in LU Decomposition with Partial Pivoting
SIAM Journal on Matrix Analysis and Applications
Graphical models for machine learning and digital communication
Graphical models for machine learning and digital communication
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Architecture-cognizant divide and conquer algorithms
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An adaptive software library for fast Fourier transforms
Proceedings of the 14th international conference on Supercomputing
Transforming loops to recursion for multi-level memory hierarchies
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Overcoming the challenges to feedback-directed optimization (Keynote Talk)
DYNAMO '00 Proceedings of the ACM SIGPLAN workshop on Dynamic and adaptive compilation and optimization
ACM Transactions on Mathematical Software (TOMS) - Special issue in honor of John Rice's 65th birthday
Note on generalization in experimental algorithmics
ACM Transactions on Mathematical Software (TOMS)
Adaptive optimization in the Jalapeño JVM (poster session)
OOPSLA '00 Addendum to the 2000 proceedings of the conference on Object-oriented programming, systems, languages, and applications (Addendum)
Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
The hardness of cache conscious data placement
POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Adaptive Optimizing Compilers for the 21st Century
The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Fast Automatic Generation of DSP Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Algorithm Selection using Reinforcement Learning
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Runtime Interprocedural Data Placement Optimisation for Lazy Parallel Libraries (Extended Abstract)
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Delayed Evaluation, Self-optimising Software Components as a Programming Model
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
MPI-2: Extending the Message-Passing Interface
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing - Volume I
Extending the Hong-Kung Model to Memory Hierarchies
COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
ECOOP '98 Workshop ion on Object-Oriented Technology
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Active harmony: towards automated performance tuning
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Better tiling and array contraction for compiling scientific programs
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A high-level approach to synthesis of high-performance codes for quantum chemistry
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Compiler optimization-space exploration
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Continuous program optimization: A case study
ACM Transactions on Programming Languages and Systems (TOPLAS)
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Meta optimization: improving compiler heuristics with machine learning
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Automatic Analytical Modeling for the Estimation of Cache Misses
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
ADAPT: Automated De-Coupled Adaptive Program Transformation
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The PHiPAC v1.0 Matrix-Multiply Distribution
The PHiPAC v1.0 Matrix-Multiply Distribution
I/O-Efficient Algorithms for Problems on Grid-Based Terrains
Journal of Experimental Algorithmics (JEA)
Automatic benchmark generation for cache optimization of matrix operations
ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
A framework for adaptive algorithm selection in STAPL
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reduction Transformations for Optimization Parameter Selection
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
SmartApps: middle-ware for adaptive applications on reconfigurable platforms
ACM SIGOPS Operating Systems Review
A comparison of online and offline strategies for program adaptation
ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Rapidly Selecting Good Compiler Optimizations using Performance Counters
Proceedings of the International Symposium on Code Generation and Optimization
MPI collective algorithm selection and quadtree encoding
Parallel Computing
Exploring and predicting the architecture/optimising compiler co-design space
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Automated transformation for performance-critical kernels
LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Data mining for simulation algorithm selection
Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Adaptive Application Composition in Quantum Chemistry
QoSA '09 Proceedings of the 5th International Conference on the Quality of Software Architectures: Architectures for Adaptive Software Systems
An Efficient and Adaptive Mechanism for Parallel Simulation Replication
PADS '09 Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Automating the runtime performance evaluation of simulation algorithms
Winter Simulation Conference
Probabilistic auto-tuning for architectures with complex constraints
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Efficiently exploring compiler optimization sequences with pairwise pruning
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Selecting Simulation Algorithm Portfolios by Genetic Algorithms
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
ACM Transactions on Embedded Computing Systems (TECS)
Predictive modeling in a polyhedral optimization space
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations
Software—Practice & Experience
Decision trees and MPI collective algorithm selection problem
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Towards making autotuning mainstream
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e. actually running the code). This paper presents quantitative data that motivate the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found. Secondly, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations when the space of inputs can be described by continuously varying features. We address both problems by using statistical modeling techniques that exploit the large amount of performance data collected during the search. We demonstrate these methods on actual performance data collected by the PHiPAC tuning system for dense matrix multiply. We close with a survey of recent projects that use or otherwise advocate an empirical search-based approach to code generation and algorithm selection, whether at the level of computational kernels, compiler and run-time systems, or problem-solving environments. Collectively, these efforts suggest a number of possible software architectures for constructing platform-adapted libraries and applications.