Revisiting Cache Block Superloading

Authors:
Matthew A. Watkins;Sally A. Mckee;Lambert Schaelicke
Affiliations:
School of Electrical and Computer Engineering, Cornell University,;Department of Computer Science and Engineering, Chalmers University of Technology,;Fort Collins Design Center, Intel Corporation,
Venue:
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Year:
2008

Citing 25
Cited 1

MIPS RISC architecture

MIPS RISC architecture
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Using virtual lines to enhance locality exploitation

ICS '94 Proceedings of the 8th international conference on Supercomputing
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
Decoupled Sectored Caches

IEEE Transactions on Computers
Run-time spatial locality detection and optimization

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting spatial locality in data caches using spatial footprints

Proceedings of the 25th annual international symposium on Computer architecture
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure definition

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Adapting cache line size to application behavior

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks

ACM Transactions on Computer Systems (TOCS)
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
Data-Centric Transformations for Locality Enhancement

International Journal of Parallel Programming
Iteration Space Slicing for Locality

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Profiling I/O Interrupts in Modern Architectures

MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Pursuing the Performance Potential of Dynamic Cache Line Sizes

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design
Array restructuring for cache locality

Array restructuring for cache locality
Restructuring computations for temporal data cache locality

International Journal of Parallel Programming
Data Cache Prefetching Using a Global History Buffer

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Accurate and Complexity-Effective Spatial Pattern Prediction

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Technological advances and increasingly complex and dynamic application behavior argue for revisiting mechanisms that adapt logical cache block size to application characteristics. This approach to bridging the processor/memory performance gap has been studied before, but mostly via trace-driven simulation, looking only at L1 caches. Given changes in hardware/software technology, we revisit the general approach: we propose a transparent, phase-adaptive, low-complexity mechanism for L2 superloading and evaluate it on a full-system simulator for 23 SPEC CPU2000 codes. Targeting L2 benefits instruction and data fetches. We investigate cache blocks of 32-512B, confirming that no fixed size performs well for all applications: differences range from 5-49% between best and worst fixed block sizes. Our scheme obtains performance similar to the per application best static block size. In a few cases, we minimally decrease performance compared to the best static size, but best size varies per application, and rarely matches real hardware. We generally improve performance over best static choices by up to 10%. Phase adaptability particularly benefits multiprogrammed workloads with conflicting locality characteristics, yielding performance gains of 5-20%. Our approach also outperforms next-line and delta prefetching.