Hierarchical Z-buffer visibility
SIGGRAPH '93 Proceedings of the 20th annual conference on Computer graphics and interactive techniques
FBRAM: a new form of memory optimized for 3D graphics
SIGGRAPH '94 Proceedings of the 21st annual conference on Computer graphics and interactive techniques
Talisman: commodity realtime 3D graphics for the PC
SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Realizing OpenGL: two implementations of one architecture
HWWS '97 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Accommodating memory latency in a low-cost rasterizer
HWWS '97 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Rendering complex scenes with memory-coherent ray tracing
Proceedings of the 24th annual conference on Computer graphics and interactive techniques
The design and analysis of a cache architecture for texture mapping
Proceedings of the 24th annual international symposium on Computer architecture
Evaluation of high performance multicache parallel texture mapping
ICS '98 Proceedings of the 12th international conference on Supercomputing
Multi-level texture caching for 3D graphics hardware
Proceedings of the 25th annual international symposium on Computer architecture
Prefetching in a texture cache architecture
HWWS '98 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
HWWS '99 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Dead-block prediction & dead-block correlating prefetchers
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Timekeeping in the memory system: predicting and optimizing memory behavior
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Texram: A Smart Memory for Texturing
IEEE Computer Graphics and Applications
SIGGRAPH '83 Proceedings of the 10th annual conference on Computer graphics and interactive techniques
A subdivision algorithm for computer display of curved surfaces.
A subdivision algorithm for computer display of curved surfaces.
Triangle order optimization for graphics hardware computation culling
I3D '06 Proceedings of the 2006 symposium on Interactive 3D graphics and games
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
Counter-Based Cache Replacement and Bypassing Algorithms
IEEE Transactions on Computers
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Proceedings of the 36th annual international symposium on Computer architecture
A study of replacement algorithms for a virtual-storage computer
IBM Systems Journal
Evaluation techniques for storage hierarchies
IBM Systems Journal
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
Using dead blocks as a virtual victim cache
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
IEEE Micro
Vantage: scalable and efficient fine-grain cache partitioning
Proceedings of the 38th annual international symposium on Computer architecture
Bypass and insertion algorithms for exclusive last-level caches
Proceedings of the 38th annual international symposium on Computer architecture
SHiP: signature-based hit predictor for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
IEEE Micro
Probabilistic shared cache management (PriSM)
Proceedings of the 39th Annual International Symposium on Computer Architecture
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams. In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.