Boosting mobile GPU performance with a decoupled access/execute fragment processor

Authors:
José-María Arnau;Joan-Manuel Parcerisa;Polychronis Xekalakis
Affiliations:
Universitat Politècnica de Catalunya;Universitat Politècnica de Catalunya;Intel Barcelona Research Center, Intel Labs Barcelona
Venue:
Proceedings of the 39th Annual International Symposium on Computer Architecture
Year:
2012

Citing 24
Cited 3

Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Prefetching in a texture cache architecture

HWWS '98 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Improving Latency Tolerance of Multithreading through Decoupling

IEEE Transactions on Computers
Going the distance for TLB prefetching: an application-driven study

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Graphics for the masses: a hardware rasterization architecture for mobile phones

ACM SIGGRAPH 2003 Papers
A flexible simulation framework for graphics architectures

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Designing Graphics Programming Interfaces for Mobile Devices

IEEE Computer Graphics and Applications
Power analysis of mobile 3D graphics

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
NoC-aware cache design for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An analysis of power consumption in a smartphone

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading

Proceedings of the 38th annual international symposium on Computer architecture
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers

TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Parallel frame rendering: trading responsiveness for energy on a mobile GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Smartphones represent one of the fastest growing markets, providing significant hardware/software improvements every few months. However, supporting these capabilities reduces the operating time per battery charge. The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna. In this paper, we focus on improving the energy efficiency of the GPU since graphical applications consist an important part of the existing market. Moreover, the trend towards better screens will inevitably lead to a higher demand for improved graphics rendering. We show that the main bottleneck for these applications is the texture cache and that traditional techniques for hiding memory latency (prefetching, multithreading) do not work well or come at a high energy cost. We thus propose the migration of GPU designs towards the decoupled access-execute concept. Furthermore, we significantly reduce bandwidth usage in the decoupled architecture by exploiting inter-core data sharing. Using commercial Android applications, we show that the end design can achieve 93% of the performance of a heavily multithreaded GPU while providing energy savings of 34%.