GPUWattch: enabling energy optimizations in GPGPUs

Authors:
Jingwen Leng;Tayler Hetherington;Ahmed ElTantawy;Syed Gilani;Nam Sung Kim;Tor M. Aamodt;Vijay Janapa Reddi
Affiliations:
The University of Texas at Austin;University of British Columbia;University of British Columbia;University of Wisconsin-Madison;University of Wisconsin-Madison;University of British Columbia;The University of Texas at Austin
Venue:
Proceedings of the 40th Annual International Symposium on Computer Architecture
Year:
2013

Citing 23
Cited 10

Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Deterministic Clock Gating for Microprocessor Power Reduction

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Power Consumption of GPUs from a Software Perspective

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Moving the needle, computer architecture research in academe and industry

Proceedings of the 37th annual international symposium on Computer architecture
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Statistical power modeling of GPU kernels using performance counters

GREENCOMP '10 Proceedings of the International Conference on Green Computing
Understanding the Energy Consumption of Dynamic Random Access Memories

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Performance and Power Analysis of ATI GPU: A Statistical Approach

NAS '11 Proceedings of the 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Multi2Sim: a simulation framework for CPU-GPU computing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Power Aware Computing on GPUs

SAAHPC '12 Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing

Convolution engine: balancing efficiency & flexibility in specialized computing

Proceedings of the 40th Annual International Symposium on Computer Architecture
A measurement study of GPU DVFS on energy conservation

Proceedings of the Workshop on Power-Aware Computing and Systems
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A locality-aware memory hierarchy for energy-efficient GPU architectures

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Roofline-aware DVFS for GPUs

Proceedings of International Workshop on Adaptive Self-tuning Computing Systems
Measuring GPU Power with the K20 Built-in Sensor

Proceedings of Workshop on General Purpose Processing Using GPUs
Power Modeling for Heterogeneous Processors

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.