Semicoarsening Multigrid on Distributed Memory Machines
SIAM Journal on Scientific Computing
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Active harmony: towards automated performance tuning
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Parallel Parameter Tuning for Applications with Performance Variability
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
The Journal of Supercomputing
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Ablego: a function outlining and partial inlining framework: Research Articles
Software—Practice & Experience
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
Model-guided empirical optimization for memory hierarchy
Model-guided empirical optimization for memory hierarchy
Iterative optimization in the polyhedral model: part ii, multidimensional time
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Annotation-based empirical performance tuning using Orio
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A code isolator: isolating code fragments from large programs
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Effective source-to-source outlining to support whole program empirical optimization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Portable section-level tuning of compiler parallelized applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMP and MPI application energy measurement variation
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Towards making autotuning mainstream
International Journal of High Performance Computing Applications
Tools for machine-learning-based empirical autotuning and specialization
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
In this paper, we take a concrete step towards materializing our long-term goal of providing a fully automatic end-to-end tuning infrastructure for arbitrary program components and full applications. We describe a general-purpose offline auto-tuning framework and apply it to an application benchmark, SMG2000, a semi-coarsening multigrid on structured grids. We show that the proposed system first extracts computationally intensive loop nests into separate executable functions, a code transformation called outlining. The outlined loop nests are then tuned by the framework and subsequently integrated back into the application. Each loop nest is optimized through a series of composable code transformations, with the transformations parameterized by unbound optimization parameters that are bound during the tuning process. The values for these parameters are selected using a search-based auto-tuner, which performs a parallel heuristic search for the best-performing optimized variants of the outlined loop nests. We show that our system pinpoints a code variant that performs 2.37 times faster than the original loop nest. When the full application is run using the code variant found by the system, the applicationâ聙聶s performance improves by 27%.