Inter-warp instruction temporal locality in deep-multithreaded GPUs

Authors:
Ahmad Lashgar;Amirali Baniasadi;Ahmad Khonsari
Affiliations:
School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran;Electrical and Computer Engineering Department, University of Victoria, Victoria, British Columbia, Canada;School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran,School of Computer Science, Institute for Research in Fundamental Sciences, Teh ...
Venue:
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Year:
2013

Citing 11
Cited 0

Stage-skip pipeline: a low power processor architecture using a decoded instruction buffer

ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
The filter cache: an energy efficient memory structure

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Power Consumption of GPUs from a Software Perspective

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic detection of uniform and affine vectors in GPGPU computations

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Performance and Power Analysis of ATI GPU: A Statistical Approach

NAS '11 Proceedings of the 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This observation provides several opportunities to enhance GPUs. We discuss different possibilities and evaluate filter cache as a case study. Moreover, we investigate how variations in microarchitectural parameters impacts potential filter cache benefits in GPUs.