LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Data dependence and program restructuring
The Journal of Supercomputing
A general framework for iteration-reordering loop transformations
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Maximizing parallelism and minimizing synchronization with affine transforms
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data transformations for eliminating conflict misses
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Terascale spectral element algorithms and implementations
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
Proceedings of the 14th international conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
Adaptive Optimizing Compilers for the 21st Century
The Journal of Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Iteration Space Slicing for Locality
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Finding effective compilation sequences
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers
Tuning High Performance Kernels through Empirical Compilation
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Empirical optimization for a sparse linear solver: a case study
International Journal of Parallel Programming - Special issue: The next generation software program
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies
International Journal of Parallel Programming
Profitable loop fusion and tiling using model-driven empirical search
Proceedings of the 20th annual international conference on Supercomputing
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
Model-guided empirical optimization for memory hierarchy
Model-guided empirical optimization for memory hierarchy
Iterative optimization in the polyhedral model: part ii, multidimensional time
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Annotation-based empirical performance tuning using Orio
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
HiPC'08 Proceedings of the 15th international conference on High performance computing
A language for the compact representation of multiple program versions
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Speeding up Nek5000 with autotuning and specialization
Proceedings of the 24th ACM International Conference on Supercomputing
Exposing tunable parameters in multi-threaded numerical code
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
Computer Science - Research and Development
On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system
Proceedings of the 8th ACM International Conference on Computing Frontiers
The challenges of writing portable, correct and high performance libraries for GPUs
ACM SIGARCH Computer Architecture News
Hirundo: a mechanism for automated production of optimized data stream graphs
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Automated programmable control and parameterization of compiler optimizations
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
POET: a scripting language for applying parameterized source-to-source program transformations
Software—Practice & Experience
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Code generation for parallel execution of a class of irregular loops on distributed memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Automatic optimization of stream programs via source program operator graph transformations
Distributed and Parallel Databases
On Expressing Strategies for Directive-Driven Multicore Programing Models
Proceedings of Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
Hi-index | 0.00 |
In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation and automatically selects the best-performing implementation. Along with the original computation, a transformation recipe specifies a range of implementations of the computation resulting from composing a set of high-level code transformations. In our system, an underlying polyhedral framework coupled with transformation algorithms takes this set of transformations, composes them and automatically generates correct code. We first describe an abstract interface for transformation recipes, which we propose to facilitate interoperability with other transformation frameworks. We then focus on the specific transformation recipe interface used in our compiler and present performance results on its application to kernel and library tuning and tuning of key computations in high-end applications. We also show how this framework can be used to generate and auto-tune parallel OpenMP or CUDA code from a high-level specification.