Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Load latency tolerance in dynamically scheduled processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Optimal replacements in caches with two miss costs
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Code transformations to improve memory parallelism
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
A fully associative software-managed cache design
Proceedings of the 27th annual international symposium on Computer architecture
The memory gap and the future of high performance memories
ACM SIGARCH Computer Architecture News
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Using SimPoint for accurate and efficient simulation
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Cost-Sensitive Cache Replacement Algorithms
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
The V-Way Cache: Demand Based Associativity via Global Replacement
Proceedings of the 32nd annual international symposium on Computer Architecture
ARC: A Self-Tuning, Low Overhead Replacement Cache
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Adaptive Caches: Effective Shaping of Cache Behavior to Workloads
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Focused prefetching: performance oriented prefetching based on commit stalls
Proceedings of the 22nd annual international conference on Supercomputing
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
SP-NUCA: a cost effective dynamic non-uniform cache architecture
ACM SIGARCH Computer Architecture News
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
ACM Transactions on Architecture and Code Optimization (TACO)
Less reused filter: improving l2 cache performance via filtering less reused lines
Proceedings of the 23rd international conference on Supercomputing
Divide-and-conquer: a bubble replacement for low level caches
Proceedings of the 23rd international conference on Supercomputing
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Proceedings of the 36th annual international symposium on Computer architecture
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Application-aware prioritization mechanisms for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
MLP-aware dynamic cache partitioning
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Global management of cache hierarchies
Proceedings of the 7th ACM international conference on Computing frontiers
Where replacement algorithms fail: a thorough analysis
Proceedings of the 7th ACM international conference on Computing frontiers
Instruction-based reuse-distance prediction for effective cache management
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Aérgia: exploiting packet latency slack in on-chip networks
Proceedings of the 37th annual international symposium on Computer architecture
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Using dead blocks as a virtual victim cache
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Quality of service shared cache management in chip multiprocessor architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Power and performance aware reconfigurable cache for CMPs
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Sampling Dead Block Prediction for Last-Level Caches
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A Predictive Model for Dynamic Microarchitectural Adaptivity Control
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Extended histories: improving regularity and performance in correlation prefetchers
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Management policies analysis for multi-core shared caches
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Dynamic cache partitioning based on the MLP of cache misses
Transactions on high-performance embedded architectures and compilers III
Enhanced adaptive insertion policy for shared caches
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
The gradient-based cache partitioning algorithm
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
CRUISE: cache replacement and utility-aware scheduling
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
PACMan: prefetch-aware cache management for high performance caching
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable shared-cache management by containing thrashing workloads
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
DIEF: an accurate interference feedback mechanism for chip multiprocessor memory systems
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
MLP-Aware instruction queue resizing: the key to power-efficient performance
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Proceedings of the 26th ACM international conference on Supercomputing
A case for exploiting subarray-level parallelism (SALP) in DRAM
Proceedings of the 39th Annual International Symposium on Computer Architecture
SAC: rethinking the cache replacement policy for SSD-based storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
Optimal bypass monitor for high performance last-level caches
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The evicted-address filter: a unified mechanism to address both cache pollution and thrashing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
Reuse-based online models for caches
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Dynamic cache management in multi-core architectures through run-time adaptation
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
An empirical model for predicting cross-core performance interference on multicore processors
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Managing shared last-level cache in a heterogeneous multicore processor
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Insertion and promotion for tree-based PseudoLRU last-level caches
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic microarchitectural adaptation using machine learning
ACM Transactions on Architecture and Code Optimization (TACO)
WADE: Writeback-aware dynamic cache management for NVM-based main memory system
ACM Transactions on Architecture and Code Optimization (TACO)
An effectiveness-based adaptive cache replacement policy
Microprocessors & Microsystems
Hi-index | 0.00 |
Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.