Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
Chap - a SIMD graphics processor
SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Multithreaded Vector Architectures
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Vector microprocessors
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads
IISWC '10 Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)
Simplified vector-thread architectures for flexible and efficient data-parallel accelerators
Simplified vector-thread architectures for flexible and efficient data-parallel accelerators
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
Proceedings of the 38th annual international symposium on Computer architecture
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Active thread compaction for GPU path tracing
Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
GPU Computing Gems Jade Edition
GPU Computing Gems Jade Edition
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
Proceedings of the 39th Annual International Symposium on Computer Architecture
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.