Imbalanced cache partitioning for balanced data-parallel programs

Authors:
Abhisek Pan;Vijay S. Pai
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 37
Cited 0

Optimal Partitioning of Cache Memory

IEEE Transactions on Computers
SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Phase tracking and prediction

Proceedings of the 30th annual international symposium on Computer architecture
Miss Rate Prediction across All Program Inputs

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Comparing Program Phase Detection Techniques

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Locality phase prediction

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Reuse-distance-based miss-rate prediction on a per instruction basis

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Miss Rate Prediction Across Program Inputs and Cache Configurations

IEEE Transactions on Computers
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Refactoring for Data Locality

Computer
Pinpointing and Exploiting Opportunities for Enhancing Data Reuse

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
Program locality analysis using reuse distance

ACM Transactions on Programming Languages and Systems (TOPLAS)
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Evaluation techniques for storage hierarchies

IBM Systems Journal
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cache Partitioning on Chip Multi-Processors for Balanced Parallel Scientific Applications

PDCAT '09 Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
Accelerating multicore reuse distance analysis with sampling and parallelization

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Quality of service shared cache management in chip multiprocessor architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Dynamic program phase detection in distributed shared- memory multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Detecting phases in parallel applications on shared memory architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Path-Based reuse distance analysis

CC'06 Proceedings of the 15th international conference on Compiler Construction
Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

ACM Transactions on Computer Systems (TOCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates partitioning the ways of a shared last-level cache among the threads of a symmetric data-parallel application running on a chip-multiprocessor. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming workload, the domain of multithreaded programs requires both throughput and fairness. Additionally, our workloads show no obvious thread differences to exploit: program threads see nearly identical IPC and data reuse as they progress (as expected for a well-written load-balanced data-parallel program). Despite the balance and symmetry among threads, this paper shows that a balanced partitioning of cache ways between threads is suboptimal. Instead, this paper proposes a strategy of temporarily imbalancing the partitions between different threads to improve cache utilization by adapting to the locality behavior of the threads as captured by dynamic set-specific reuse-distance (SSRD). Cumulative SSRD histograms have knees that correspond to different important working sets; thus, cache ways can be taken away from a thread with only minimal performance impact if that thread is currently operating far from a knee. Those ways can then be given to a single "preferred" thread to push it over the next knee. The preferred thread is chosen in a round-robin fashion to ensure balanced progress over the execution. The algorithm also effectively handles scenarios where an unpartitioned cache might outperform any sort of explicit partitioning. This dynamic partition imbalance algorithm allows up to 44% reduction in execution time and 91% reduction in misses over an unpartitioned shared cache for 9 benchmarks from the PARSEC-2.0 and SPEC OMP suites.