Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multithreaded Vector Architectures
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Control Flow Optimization Via Dynamic Reconvergence Prediction
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Conjoined-Core Chip Multiprocessing
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture
IEEE Micro
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Transparent control independence (TCI)
Proceedings of the 34th annual international symposium on Computer architecture
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
IEEE Micro
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Dynamic detection of uniform and affine vectors in GPGPU computations
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Barra: A Parallel Functional Simulator for GPGPU
MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
PEPSC: A Power-Efficient Processor for Scientific Computing
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Towards Solving the Table Maker's Dilemma on GPU
PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Warp size impact in GPUs: large or small?
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Efficient scheduling of recursive control flow on GPUs
Proceedings of the 27th international ACM conference on International conference on supercomputing
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction
Proceedings of the 40th Annual International Symposium on Computer Architecture
Evaluator-executor transformation for efficient pipelining of loops with conditionals
ACM Transactions on Architecture and Code Optimization (TACO)
HARP: Harnessing inactive threads in many-core processors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.