Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Authors:
Syed Zohaib Gilani;Nam Sung Kim;Michael J. Schulte
Affiliations:
The University of Wisconsin--Madison;The University of Wisconsin--Madison;AMD Research, Advanced Micro Devices, Inc.
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 17
Cited 0

Power minimization of high-performance submicron CMOS circuits using a dual-Vdd dual-Vth (DVDV) approach

ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques

IEEE Transactions on Computers
Managing power and performance for System-on-Chip designs using Voltage Islands

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Pushing ASIC performance in a power envelope

Proceedings of the 40th annual Design Automation Conference
Advanced Clockgating Schemes for Fused-Multiply-Add-Type Floating-Point Units

ARITH '09 Proceedings of the 2009 19th IEEE Symposium on Computer Arithmetic
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
Energy-Efficient Floating-Point Unit Design

IEEE Transactions on Computers
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Energy-efficient floating-point arithmetic for software-defined radio architectures

ASAP '11 Proceedings of the ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors
Power balanced pipelines

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern GPUs share limited hardware resources, such as register files, among a large number of concurrently executing threads. For efficient resource sharing, several buffering and collision avoidance stages are inserted in the GPU pipeline. These additional stages increase the read-after-write (RAW) latencies of instructions. Since GPUs are often architected to hide RAW latencies through extensive multithreading, they typically do not employ power-hungry data-forwarding networks (DFNs). However, we observe that many GPGPU applications do not have enough active threads that are ready to issue instructions to hide these RAW latencies. In this paper, we first demonstrate that DFNs can considerably improve the performance of many compute-intensive GPGPU applications and then propose most recent result forwarding (MoRF) as a low-power alternative to the DFN. Second, for floating-point (FP) operations, we exploit a high-throughput fused multiply-add (HFMA) unit to further reduce both RAW latencies and the number of FMA units in the GPU without impacting instruction throughput. MoRF and HFMA together provide a geometric mean performance improvement of 18% and 29% for integer/single-precision and double-precision GPGPU applications, respectively. Finally, both MoRF and HFMA allow the GPU to effectively mimic a shallower pipeline for a large percentage of instructions. Exploiting such a benefit, we propose low-power pipelines that can reduce peak power consumption by 14% without affecting the performance or increasing the complexity of the forwarding network. The peak power reduction allows GPUs to operate more cores within the same power budget, achieving a geometric mean performance improvement of 33% for double-precision GPGPU applications.