Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Proceedings of the 27th annual international symposium on Computer architecture
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
ClawHMMER: A Streaming HMMer-Search Implementatio
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Memory and control organizations of stream processors
Memory and control organizations of stream processors
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
A characterization and analysis of PTX kernels
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Hi-index | 0.00 |
GPGPUs are optimized for graphics, for that reason the hardware is optimized for massively data parallel applications characterized by predictable memory access patterns and little control flow. For such applications' e.g., matrix multiplication, GPGPU based system can achieve very high performance. However, many general purpose data parallel applications are characterized as having intensive control flow and unpredictable memory access patterns. Optimizing the code in such problems for current hardware is often ineffective and even impractical since it exhibits low hardware utilization leading to relatively low performance. This work tracks the root causes of execution inefficacies when running control flow intensive CUDA applications on NVIDIA GPGPU hardware. We show both analytically and by simulations of various benchmarks that local thread scheduling has inherent limitations when dealing with applications that have high rate of branch divergence. To overcome those limitations we propose to use hierarchical warp scheduling and global warps reconstruction. We implement an ideal hierarchical warp scheduling mechanism we term ODGS (Oracle Dynamic Global Scheduling) designed to maximize machine utilization via global warp reconstruction. We show that in control flow bound applications that make no use of shared memory (1) there is still a substantial potential for performance improvement (2) we demonstrate, based on various synthetic and real benchmarks the feasible performance improvement. For example, MUM and BFS are parallel graph algorithms suffering from significant branch divergence. We show that in those algorithms it's possible to achieve performance gain of up to x4.4 and x2.6 relative to previously applied scheduling methods.