Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Tuning High Performance Kernels through Empirical Compilation
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
In search of the optimal Walsh-Hadamard transform
ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 06
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Algebraic signal processing theory: Cooley-Tukey type algorithms for real DFTs
IEEE Transactions on Signal Processing
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Maestro: data orchestration and tuning for OpenCL devices
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Automatic Library Generation for BLAS3 on GPUs
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Hi-index | 0.00 |
Auto-tuning has emerged as an important practical method for creating highly optimized implementations of key computational kernels and applications. However, the growing complexity of architectures and applications is creating new challenges for auto-tuning. Complex applications can involve a prohibitively large search space that precludes empirical auto-tuning. Similarly, architectures are getting more complicated, making it hard to model performance. In this paper, we focus on the challenge to auto-tuning presented by applications with a large number of kernels and kernel instantiations. While these kernels may share a somewhat similar pattern, they differ considerably in problem sizes and the exact computation performed. We propose and evaluate a new approach to auto-tuning which we refer to as parameterized micro-benchmarking. It is an alternative to the two existing classes of approaches to auto-tuning: analytical model-based and empirical search-based. Particularly, we argue that the former may not be able to capture all the architectural features that impact performance, whereas the latter might be too expensive for an application that has several different kernels. In our approach, different expressions in the application, different possible implementations of each expression, and the key architectural features, are used to derive a simple micro-benchmark and a small parameter space. We have evaluated our approach in the context of GPU implementations of tensor contraction expressions.