Store Memory-Level Parallelism Optimizations for Commercial Applications

Authors:
Yuan Chou;Lawrence Spracklen;Santosh G. Abraham
Affiliations:
Sun Microsystems;Sun Microsystems;Sun Microsystems
Venue:
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2005

Citing 30
Cited 9

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A methodology for implementing highly concurrent data objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache write policies and performance

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
An analysis of database workload performance on simultaneous multithreaded processors

Proceedings of the 25th annual international symposium on Computer architecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Queuing Simulation Model for Multiprocessor Systems

Computer
Write buffer design for cache-coherent shared-memory multiprocessors

ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Impact of Reducing Miss Write Latencies in Multiprocessors with Two Level Cache

EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Scaling and Charact rizing Database Workloads: Bridging the Gap between Research and Practice

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Effective stream-based and execution-based data prefetching

Proceedings of the 18th annual international conference on Supercomputing
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Chip Multithreading: Opportunities and Challenges

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Load and Store Processing in Latency Tolerant Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
High-Performance Throughput Computing

IEEE Micro
Request Reordering to Enhance the Performance of Strict Consistency Models

IEEE Computer Architecture Letters
Issues in the design of store buffers in dynamically scheduled processors

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software

Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays

IEEE Micro
An analysis of the effects of miss clustering on the cost of a cache miss

Proceedings of the 4th international conference on Computing frontiers
Making the fast case common and the uncommon case simple in unbounded transactional memory

Proceedings of the 34th annual international symposium on Computer architecture
Hardware atomicity for reliable software speculation

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Pipeline spectroscopy

Proceedings of the 2007 workshop on Experimental computer science
Pipeline spectroscopy

ecs'07 Experimental computer science on Experimental computer science
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.