Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Authors:
Jaekyu Lee;Nagesh B. Lakshminarayana;Hyesoon Kim;Richard Vuduc
Affiliations:
-;-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 27
Cited 10

Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
A novel cache design for vector processing

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Vector multiprocessors with arbitrated memory access

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Minimizing Conflicts Between Vector Streams in Interleaved Memory Systems

IEEE Transactions on Computers
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study

IEEE Transactions on Computers
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Effective stream-based and execution-based data prefetching

Proceedings of the 18th annual international conference on Supercomputing
AC/DC: An Adaptive Data Cache Prefetcher

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Coordinated control of multiple prefetchers in multi-core systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Resource-Aware Compiler Prefetching for Many-Cores

ISPDC '10 Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques

A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching

Proceedings of the 9th conference on Computing Frontiers
Boosting mobile GPU performance with a decoupled access/execute fragment processor

Proceedings of the 39th Annual International Symposium on Computer Architecture
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Network-aware data caching and prefetching for cloud-hosted metadata retrieval

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
The Implementation of a High Performance GPGPU Compiler

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of how to improve memory latency tolerance in massively multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to hide memory latency. One solution used in conventional CPU systems is prefetching, both in hardware and software. However, we show that straightforwardly applying such mechanisms to GPGPU systems does not deliver the expected performance benefits and can in fact hurt performance when not used judiciously. This paper proposes new hardware and software prefetching mechanisms tailored to GPGPU systems, which we refer to as many-thread aware prefetching (MT-prefetching) mechanisms. Our software MT-prefetching mechanism, called inter-thread prefetching, exploits the existence of common memory access behavior among fine-grained threads. For hardware MT-prefetching, we describe a scalable prefetcher training algorithm along with a hardware-based inter-thread prefetching mechanism. In some cases, blindly applying prefetching degrades performance. To reduce such negative effects, we propose an adaptive prefetch throttling scheme, which permits automatic GPGPU application- and hardware-specific adjustment. We show that adaptation reduces the negative effects of prefetching and can even improve performance. Overall, compared to the state-of-the-art software and hardware prefetching, our MT-prefetching improves performance on average by 16%(software pref.) / 15% (hardware pref.) on our benchmarks.