Location-aware cache management for many-core processors with deep cache hierarchy

Authors:
Jongsoo Park;Richard M. Yoo;Daya S. Khudia;Christopher J. Hughes;Daehyun Kim
Affiliations:
Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;University of Michigan - Ann Arbor;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 32
Cited 0

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Delayed consistency and its effects on the miss rate of parallel programs

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The Stanford Dash Multiprocessor

Computer
The KSR1: experimentation and modeling of poststore

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Scalable queue-based spin locks with timeout

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

EDTC '97 Proceedings of the 1997 European conference on Design and Test
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors

Communications of the ACM
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors

Proceedings of the 37th annual international symposium on Computer architecture
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient techniques for predicting cache sharing and throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Studying multicore processor scaling via reuse distance analysis

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications behavior: e.g., caches may be polluted by data structures whose locality cannot be captured by the caches, and producer-consumer communication incurs multiple round trips of coherence messages per cache line transferred. We propose load and store instructions that carry hints regarding into which cache(s) the accessed data should be placed. Our instructions allow software to convey locality information to the hardware, while incurring minimal hardware cost and not affecting correctness. Our instructions provide a 1.07x speedup and a 1.24x energy efficiency boost, on average, according to simulations on a 64-core system with private L1 and L2 caches. With a large shared L3 cache added, the benefits increase, providing 1.33x energy reduction on average.