Scalable Parallel Programming with CUDA
Queue - GPU Computing
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Increasing memory miss tolerance for SIMD cores
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic detection of uniform and affine vectors in GPGPU computations
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Proceedings of the 38th annual international symposium on Computer architecture
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Divergence Analysis and Optimizations
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
Proceedings of the 39th Annual International Symposium on Computer Architecture
Power-efficient computing for compute-intensive GPGPU applications
HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
Hi-index | 0.00 |
SIMT architectures improve performance and efficiency by exploiting control and memory-access structure across data-parallel threads. Value structure occurs when multiple threads operate on values that can be compactly encoded, e.g., by using a simple function of the thread index. We characterize the availability of control, memory-access, and value structure in typical kernels and observe ample amounts of value structure that is largely ignored by current SIMT architectures. We propose three microarchitectural mechanisms to exploit value structure based on compact affine execution of arithmetic, branch, and memory instructions. We explore these mechanisms within the context of traditional SIMT microarchitectures (GP-SIMT), found in general-purpose graphics processing units, as well as fine-grain SIMT microarchitectures (FG-SIMT), a SIMT variant appropriate for compute-focused data-parallel accelerators. Cycle-level modeling of a modern GP-SIMT system and a VLSI implementation of an eight-lane FG-SIMT execution engine are used to evaluate a range of application kernels. When compared to a baseline without compact affine execution, our approach can improve GP-SIMT cycle-level performance by 4-17% and can improve FG-SIMT absolute performance by 20-65% and energy efficiency up to 30% for a majority of the kernels.