Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning

Authors:
Qingda Lu;Jiang Lin;Xiaoning Ding;Zhao Zhang;Xiaodong Zhang;P. Sadayappan
Affiliations:
-;-;-;-;-;-
Venue:
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Year:
2009

Citing 0
Cited 16

MCC-DB: minimizing cache conflicts in multi-core processors for databases

Proceedings of the VLDB Endowment
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores

Proceedings of the sixth conference on Computer systems
Controlling cache utilization of HPC applications

Proceedings of the international conference on Supercomputing
W-Order scan: minimizing cache pollution by application software level cache management for MMDB

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Code-based cache partitioning for improving hardware cache performance

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Affinity-aware DMA buffer management for reducing off-chip memory access

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Reducing last level cache pollution through OS-level software-controlled region-based partitioning

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Accurate prediction of the behavior of multithreaded applications in shared caches

Parallel Computing
Memory reuse optimizations in the R-Stream compiler

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Pacman: program-assisted cache management

Proceedings of the 2013 international symposium on memory management
GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Imbalanced cache partitioning for balanced data-parallel programs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Cache isolation for virtualization of mixed general-purpose and real-time systems

Journal of Systems Architecture: the EUROMICRO Journal
HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First,the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with theLRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs atthe data object level.To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning).We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor.Experimental results show that incomparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputsfor a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.