Prefetching using Markov predictors
Proceedings of the 24th annual international symposium on Computer architecture
Run-time spatial locality detection and optimization
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Proceedings of the 27th annual international symposium on Computer architecture
Data Cache Prefetching Using a Global History Buffer
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed order scheduling and its application to multi-core dram controllers
Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
POWER4 system microarchitecture
IBM Journal of Research and Development
Complexity effective memory access scheduling for many-core accelerator architectures
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Coordinated control of multiple prefetchers in multi-core systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Improving memory bank-level parallelism in the presence of prefetching
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
When Prefetching Works, When It Doesn’t, and Why
ACM Transactions on Architecture and Code Optimization (TACO)
Hardware transactional memory for GPU architectures
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Balancing DRAM locality and parallelism in shared memory CMP systems
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Characterizing and improving the use of demand-fetched caches in GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function
IEEE Computer Architecture Letters
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future "far-ahead" warp while executing the current warp. We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.