An overview of the PTRAN analysis system for multiprocessing
Proceedings of the 1st International Conference on Supercomputing
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
Improving register allocation for subscripted variables
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Compiler optimization-space exploration
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting the impact of optimizations for embedded systems
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Improving Cache Behavior of Dynamically Allocated Data Structures
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using Machine Learning to Focus Iterative Optimization
Proceedings of the International Symposium on Code Generation and Optimization
A systematic approach to delivering instruction-level parallelism in epic systems
A systematic approach to delivering instruction-level parallelism in epic systems
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Evaluating Heuristic Optimization Phase Order Search Algorithms
Proceedings of the International Symposium on Code Generation and Optimization
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Iterative induced dipoles computation for molecular mechanics on GPUs
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
GPU-based island model for evolutionary algorithms
Proceedings of the 12th annual conference on Genetic and evolutionary computation
Towards metaprogramming for parallel systems on a chip
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
International Journal of High Performance Computing Applications
ACO with tabu search on a GPU for solving QAPs using move-cost adjusted thread assignment
Proceedings of the 13th annual conference on Genetic and evolutionary computation
EvoCOP'11 Proceedings of the 11th European conference on Evolutionary computation in combinatorial optimization
Journal of Computational and Applied Mathematics
Advances in Engineering Software
A new parallel method of smith-waterman algorithm on a heterogeneous platform
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Implementing p systems parallelism by means of GPUs
WMC'09 Proceedings of the 10th international conference on Membrane Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
GPU-Based multi-start local search algorithms
LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
Proceedings of the 9th conference on Computing Frontiers
Generating GPU code from a high-level representation for image processing kernels
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A simulated annealing algorithm for GPU clusters
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
ACO on multiple GPUs with CUDA for faster solution of QAPs
PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part II
Parallelization strategies for hybrid metaheuristics using a single GPU and multi-core resources
PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part II
Mastering software variant explosion for GPU accelerators
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
User transparent data and task parallel multimedia computing with Pyxis-DT
Future Generation Computer Systems
Optimising space exploration of OpenCL for GPGPUs
International Journal of Computational Science and Engineering
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.