MCC-DB: minimizing cache conflicts in multi-core processors for databases
Proceedings of the VLDB Endowment
ULCC: a user-level facility for optimizing shared cache performance on multicores
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Proceedings of the sixth conference on Computer systems
Controlling cache utilization of HPC applications
Proceedings of the international conference on Supercomputing
W-Order scan: minimizing cache pollution by application software level cache management for MMDB
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Is reuse distance applicable to data locality analysis on chip multiprocessors?
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Code-based cache partitioning for improving hardware cache performance
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Affinity-aware DMA buffer management for reducing off-chip memory access
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Reducing last level cache pollution through OS-level software-controlled region-based partitioning
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Memory reuse optimizations in the R-Stream compiler
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Pacman: program-assisted cache management
Proceedings of the 2013 international symposium on memory management
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Imbalanced cache partitioning for balanced data-parallel programs
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Cache isolation for virtualization of mixed general-purpose and real-time systems
Journal of Systems Architecture: the EUROMICRO Journal
HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First,the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with theLRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs atthe data object level.To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning).We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor.Experimental results show that incomparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputsfor a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.