Approximation algorithms for NP-hard problems
Approximation algorithms for NP-hard problems
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Cache-conscious coallocation of hot data streams
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Compilers: Principles, Techniques, and Tools (2nd Edition)
Compilers: Principles, Techniques, and Tools (2nd Edition)
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Lattice Boltzmann based PDE solver on the GPU
The Visual Computer: International Journal of Computer Graphics
Scalable Parallel Programming with CUDA
Queue - GPU Computing
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU
Proceedings of the 6th ACM conference on Computing frontiers
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Increasing memory miss tolerance for SIMD cores
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Dymaxion: optimizing memory access patterns for heterogeneous systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Speculative parallelization on GPGPUs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A study towards optimal data layout for GPU computing
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing divergence in GPGPU programs with loop merging
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Atomic-free irregular computations on GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Optimizing select conditions on GPUs
Proceedings of the Ninth International Workshop on Data Management on New Hardware
Software-level scheduling to exploit non-uniformly shared data cache on GPGPU
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
General transformations for GPU execution of tree traversals
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A large-scale cross-architecture evaluation of thread-coarsening
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Programming Languages and Systems (TOPLAS)
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient Mapping of Irregular C++ Applications to Integrated GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Singe: leveraging warp specialization for high performance on GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations
International Journal of Parallel Programming
An efficient compiler framework for cache bypassing on GPUs
Proceedings of the International Conference on Computer-Aided Design
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs. This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.