Advanced compiler design and implementation
Advanced compiler design and implementation
Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
Communications of the ACM - Special issue on computer architecture
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
RPU: a programmable ray processing unit for realtime ray tracing
ACM SIGGRAPH 2005 Papers
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Increasing memory miss tolerance for SIMD cores
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Microarchitectural mechanisms to exploit value structure in SIMT architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction
Proceedings of the 40th Annual International Symposium on Computer Architecture
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Evaluator-executor transformation for efficient pipelining of loops with conditionals
ACM Transactions on Architecture and Code Optimization (TACO)
HARP: Harnessing inactive threads in many-core processors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compaction-effectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.