A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Learning to Predict Performance from Formula Modeling and Training Data
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
The PHiPAC v1.0 Matrix-Multiply Distribution
The PHiPAC v1.0 Matrix-Multiply Distribution
Better tiling and array contraction for compiling scientific programs
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Data mining for simulation algorithm selection
Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Prospectus for the next LAPACK and ScaLAPACK libraries
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Using experimental data to improve the performance modelling of parallel linear algebra routines
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Practical performance models of algorithms in evolutionary program induction and other domains
Artificial Intelligence
Automating the runtime performance evaluation of simulation algorithms
Winter Simulation Conference
Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning
International Journal of Parallel Programming
Hi-index | 0.00 |
Achieving peak performance from library subroutines usually requires extensive, machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate, at compile-time, by (1) generating a large number of possible implementations of a subroutine, and (2) selecting a fast implementation by an exhaustive, empirical search. This paper applies statistical techniques to exploit the large amount of performance data collected during the search. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations. We apply our methods to actual performance data collected by the PHiPAC tuning system for matrix multiply on a variety of hardware platforms.