APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
Communications of the ACM - Special issue on computer architecture
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Stream Register Files with Indexed Access
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Parallel operation in the control data 6600
AFIPS '64 (Fall, part II) Proceedings of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Compute Unified Device Architecture Application Suitability
Computing in Science and Engineering
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Thread block compaction for efficient SIMT control flow
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
Proceedings of the 39th Annual International Symposium on Computer Architecture
RISE: improving the streaming processors reliability against soft errors in gpgpus
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Inter-warp instruction temporal locality in deep-multithreaded GPUs
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient scheduling of recursive control flow on GPUs
Proceedings of the 27th international ACM conference on International conference on supercomputing
Cost-effective soft-error protection for SRAM-based structures in GPGPUs
Proceedings of the ACM International Conference on Computing Frontiers
Future of GPGPU micro-architectural parameters
Proceedings of the Conference on Design, Automation and Test in Europe
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction
Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
GPU-CC: a reconfigurable GPU architecture with communicating cores
Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
HARP: Harnessing inactive threads in many-core processors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the same computing kernel. GPUs exploit this parallelism in two ways. First, threads are grouped into fixed-size SIMD batches known as warps, and second, many such warps are concurrently executed on a single GPU core. Despite these techniques, the computational resources on GPU cores are still underutilized, resulting in performance far short of what could be delivered. Two reasons for this are conditional branch instructions and stalls due to long latency operations. To improve GPU performance, computational resources must be more effectively utilized. To accomplish this, we propose two independent ideas: the large warp microarchitecture and two-level warp scheduling. We show that when combined, our mechanisms improve performance by 19.1% over traditional GPU cores for a wide variety of general purpose parallel applications that heretofore have not been able to fully exploit the available resources of the GPU chip.