Stage-skip pipeline: a low power processor architecture using a decoded instruction buffer
ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
The filter cache: an energy efficient memory structure
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Power Consumption of GPUs from a Software Perspective
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic detection of uniform and affine vectors in GPGPU computations
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
Performance and Power Analysis of ATI GPU: A Statistical Approach
NAS '11 Proceedings of the 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This observation provides several opportunities to enhance GPUs. We discuss different possibilities and evaluate filter cache as a case study. Moreover, we investigate how variations in microarchitectural parameters impacts potential filter cache benefits in GPUs.