The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Communications of the ACM - Special issue on computer architecture
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Mechanism for SIMD Execution of SPMD Programs
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
A Media-Enhanced Vector Architecture for Embedded Memory Systems
A Media-Enhanced Vector Architecture for Embedded Memory Systems
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
Learning-Based SMT Processor Resource Distribution via Hill-Climbing
Proceedings of the 33rd annual international symposium on Computer Architecture
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
The M5 Simulator: Modeling Networked Systems
IEEE Micro
Performance Improvement Methodology for ClearSpeed's CSX600
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Implementation of Permutation Functions in Illiac IV-Type Computers
IEEE Transactions on Computers
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Increasing memory miss tolerance for SIMD cores
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
On the correctness of the SIMT execution model of GPUs
ESOP'12 Proceedings of the 21st European conference on Programming Languages and Systems
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
Proceedings of the 39th Annual International Symposium on Computer Architecture
iGPU: exception support and speculative execution on GPUs
Proceedings of the 39th Annual International Symposium on Computer Architecture
Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient scheduling of recursive control flow on GPUs
Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing select conditions on GPUs
Proceedings of the Ninth International Workshop on Data Management on New Hardware
Future of GPGPU micro-architectural parameters
Proceedings of the Conference on Design, Automation and Test in Europe
Using synchronization stalls in power-aware accelerators
Proceedings of the Conference on Design, Automation and Test in Europe
Microarchitectural mechanisms to exploit value structure in SIMT architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction
Proceedings of the 40th Annual International Symposium on Computer Architecture
Journal of Parallel and Distributed Computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Rhythm: harnessing data parallel hardware for server workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations
International Journal of Parallel Programming
HARP: Harnessing inactive threads in many-core processors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth x SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.