Memory access buffering in multiprocessors
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A methodology for implementing highly concurrent data objects
ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache write policies and performance
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Improving data cache performance by pre-executing instructions under a cache miss
ICS '97 Proceedings of the 11th international conference on Supercomputing
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Write buffer design for cache-coherent shared-memory multiprocessors
ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Impact of Reducing Miss Write Latencies in Multiprocessors with Two Level Cache
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Effective stream-based and execution-based data prefetching
Proceedings of the 18th annual international conference on Supercomputing
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Chip Multithreading: Opportunities and Challenges
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence
Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
High-Performance Throughput Computing
IEEE Micro
Request Reordering to Enhance the Performance of Strict Consistency Models
IEEE Computer Architecture Letters
Issues in the design of store buffers in dynamically scheduled processors
ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Making the fast case common and the uncommon case simple in unbounded transactional memory
Proceedings of the 34th annual international symposium on Computer architecture
Hardware atomicity for reliable software speculation
Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
Hi-index | 0.00 |
This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.