Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Authors:
Jayesh Gaur;Raghuram Srinivasan;Sreenivas Subramoney;Mainak Chaudhuri
Affiliations:
Intel Architecture Group, Bangalore, India;The Ohio State University, Columbus, OH;Intel Architecture Group, Bangalore, India;Indian Institute of Technology, Kanpur, India
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 36
Cited 0

Hierarchical Z-buffer visibility

SIGGRAPH '93 Proceedings of the 20th annual conference on Computer graphics and interactive techniques
FBRAM: a new form of memory optimized for 3D graphics

SIGGRAPH '94 Proceedings of the 21st annual conference on Computer graphics and interactive techniques
Talisman: commodity realtime 3D graphics for the PC

SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Realizing OpenGL: two implementations of one architecture

HWWS '97 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Accommodating memory latency in a low-cost rasterizer

HWWS '97 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Rendering complex scenes with memory-coherent ray tracing

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
The design and analysis of a cache architecture for texture mapping

Proceedings of the 24th annual international symposium on Computer architecture
Evaluation of high performance multicache parallel texture mapping

ICS '98 Proceedings of the 12th international conference on Supercomputing
Multi-level texture caching for 3D graphics hardware

Proceedings of the 25th annual international symposium on Computer architecture
Prefetching in a texture cache architecture

HWWS '98 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Parallel texture caching

HWWS '99 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Timekeeping in the memory system: predicting and optimizing memory behavior

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Texram: A Smart Memory for Texturing

IEEE Computer Graphics and Applications
Pyramidal parametrics

SIGGRAPH '83 Proceedings of the 10th annual conference on Computer graphics and interactive techniques
A subdivision algorithm for computer display of curved surfaces.

A subdivision algorithm for computer display of curved surfaces.
Triangle order optimization for graphics hardware computation culling

I3D '06 Proceedings of the 2006 symposium on Interactive 3D graphics and games
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Counter-Based Cache Replacement and Bypassing Algorithms

IEEE Transactions on Computers
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
A study of replacement algorithms for a virtual-storage computer

IBM Systems Journal
Evaluation techniques for storage hierarchies

IBM Systems Journal
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Using dead blocks as a virtual victim cache

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Fermi GF100 GPU Architecture

IEEE Micro
Vantage: scalable and efficient fine-grain cache partitioning

Proceedings of the 38th annual international symposium on Computer architecture
Bypass and insertion algorithms for exclusive last-level caches

Proceedings of the 38th annual international symposium on Computer architecture
SHiP: signature-based hit predictor for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Texture Caches

IEEE Micro
Probabilistic shared cache management (PriSM)

Proceedings of the 39th Annual International Symposium on Computer Architecture
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams. In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.