Power-efficient computing for compute-intensive GPGPU applications

Authors:
Syed Zohaib Gilani;Nam Sung Kim;Michael J. Schulte
Affiliations:
The University of Wisconsin-Madison, Madison, WI, USA;The University of Wisconsin-Madison, Madison, WI, USA;Advanced Micro Devices, Austin, TX, USA
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 4
Cited 2

Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques

Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
Computing infrastructure for big data processing

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

The peak performance of graphics processing units (GPUs) has traditionally been increased by increasing the number of compute resources and/or their frequency. However, these approaches significantly increase the power consumption of GPUs. Consequently, modern high-performance GPUs are power constrained and must employ more power efficient approaches for performance improvements in future processors. In this paper we propose three power-efficient techniques for improving the performance of GPUs. First, we observe that many GPGPU applications are integer instruction intensive. For such applications, we propose to utilize the fused multiply-add (FMA) units to fuse dependent integer instructions into a composite instruction, improving power efficiency and performance by reducing the number of fetched/executed instructions. Secondly, GPUs often perform computations that are duplicated across multiple threads. We dynamically detect such instructions and execute them in a separate scalar pipeline. Finally, the register file bandwidth in GPUs is a critical resource that is optimized for 32-bit instruction operands. However, many operands require considerably fewer bits for accurate representation and computations. We propose a sliced GPU architecture that improves performance of the GPU by dual-issuing instructions to two 16-bit execution slices. Overall, our techniques result in more than a 25% (geometric mean) power efficiency improvement.