Optimal Partitioning of Cache Memory
IEEE Transactions on Computers
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance
WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Predicting whole-program locality through reuse distance analysis
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Proceedings of the 30th annual international symposium on Computer architecture
Miss Rate Prediction across All Program Inputs
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Comparing Program Phase Detection Techniques
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
Array regrouping and structure splitting using whole-program reference affinity
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Cross-architecture performance predictions for scientific applications using parameterized models
Proceedings of the joint international conference on Measurement and modeling of computer systems
CQoS: a framework for enabling QoS in shared caches of CMP platforms
Proceedings of the 18th annual international conference on Supercomputing
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Reuse-distance-based miss-rate prediction on a per instruction basis
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Miss Rate Prediction Across Program Inputs and Cache Configurations
IEEE Transactions on Computers
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer
Pinpointing and Exploiting Opportunities for Enhancing Data Reuse
ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
Program locality analysis using reuse distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Proceedings of the 36th annual international symposium on Computer architecture
Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Evaluation techniques for storage hierarchies
IBM Systems Journal
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cache Partitioning on Chip Multi-Processors for Balanced Parallel Scientific Applications
PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
Accelerating multicore reuse distance analysis with sampling and parallelization
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Quality of service shared cache management in chip multiprocessor architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Dynamic program phase detection in distributed shared- memory multiprocessors
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Detecting phases in parallel applications on shared memory architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Is reuse distance applicable to data locality analysis on chip multiprocessors?
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Path-Based reuse distance analysis
CC'06 Proceedings of the 15th international conference on Compiler Construction
Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
Hi-index | 0.00 |
This paper investigates partitioning the ways of a shared last-level cache among the threads of a symmetric data-parallel application running on a chip-multiprocessor. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming workload, the domain of multithreaded programs requires both throughput and fairness. Additionally, our workloads show no obvious thread differences to exploit: program threads see nearly identical IPC and data reuse as they progress (as expected for a well-written load-balanced data-parallel program). Despite the balance and symmetry among threads, this paper shows that a balanced partitioning of cache ways between threads is suboptimal. Instead, this paper proposes a strategy of temporarily imbalancing the partitions between different threads to improve cache utilization by adapting to the locality behavior of the threads as captured by dynamic set-specific reuse-distance (SSRD). Cumulative SSRD histograms have knees that correspond to different important working sets; thus, cache ways can be taken away from a thread with only minimal performance impact if that thread is currently operating far from a knee. Those ways can then be given to a single "preferred" thread to push it over the next knee. The preferred thread is chosen in a round-robin fashion to ensure balanced progress over the execution. The algorithm also effectively handles scenarios where an unpartitioned cache might outperform any sort of explicit partitioning. This dynamic partition imbalance algorithm allows up to 44% reduction in execution time and 91% reduction in misses over an unpartitioned shared cache for 9 benchmarks from the PARSEC-2.0 and SPEC OMP suites.