Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

Authors:
Jiayuan Meng;Kevin Skadron
Affiliations:
Department of Computer Science, University of Virginia;Department of Computer Science, University of Virginia
Venue:
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Year:
2009

Citing 26
Cited 3

SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Tradeoffs in two-level on-chip caching

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors

Journal of Parallel and Distributed Computing
On-chip vs. off-chip memory: the data partitioning problem in embedded processor-based systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Region-based caching: an energy-delay efficient memory architecture for embedded processors

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
Affinity scheduling of unbalanced workloads

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
On the effectiveness of address-space randomization

Proceedings of the 11th ACM conference on Computer and communications security
A localizing directory coherence protocol

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploring the cache design space for large scale CMPs

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Scalable locality-conscious multithreaded memory allocation

Proceedings of the 5th international symposium on Memory management
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Reducing Verification Complexity of a Multicore Coherence Protocol Using Assume/Guarantee

FMCAD '06 Proceedings of the Formal Methods in Computer Aided Design
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
A novel technique to use scratch-pad memory for stack management

Proceedings of the conference on Design, automation and test in Europe
Bringing NoCs to 65 nm

IEEE Micro
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Acceleration of bulk memory operations in a heterogeneous multicore architecture

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.