Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

Authors:
Xi E. Chen;Tor M. Aamodt
Affiliations:
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, CANADA;Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, CANADA
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 27
Cited 5

An analytical cache model

ACM Transactions on Computer Systems (TOCS)
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Theoretical modeling of superscalar processor performance

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Olden: parallelizing programs with dynamic data structures on distributed-memory machines

Olden: parallelizing programs with dynamic data structures on distributed-memory machines
An Analytical Model for Designing Memory Hierarchies

IEEE Transactions on Computers
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
An exploration of instruction fetch requirement in out-of-order superscalar processors

International Journal of Parallel Programming - parallel architectures and compilation techniques, part II
A discussion on non-blocking/lockup-free caches

ACM SIGARCH Computer Architecture News
Benchmark health considered harmful

ACM SIGARCH Computer Architecture News
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Asim: A Performance Model Framework

Computer
Microarchitectural exploration with Liberty

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Framework for Statistical Modeling of Superscalar Processor Performance

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
Efficient performance prediction for modern microprocessors

Efficient performance prediction for modern microprocessors
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
A performance counter architecture for computing accurate CPI components

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
The Future of Simulation: A Field of Dreams

Computer
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Automated design of application-specific superscalar processors

Automated design of application-specific superscalar processors
Automated design of application specific superscalar processors: an analytical approach

Proceedings of the 34th annual international symposium on Computer architecture

Fast data-cache modeling for native co-simulation

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

ACM Transactions on Architecture and Code Optimization (TACO)
Predicting memcached throughput using simulation and modeling

Proceedings of the 2012 Symposium on Theory of Modeling and Simulation - DEVS Integrative M&S Symposium
Accurately modeling superscalar processor performance with reduced trace

Journal of Parallel and Distributed Computing
Predicting Performance Impact of DVFS for Realistic Memory Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, and can provide insights that detailed simulators may not. This paper proposes techniques to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis. Pending cache hits are secondary references to a cache block for which a request has already been initiated but has not yet completed. We find pending hits resulting from spatial locality and the fine-grained selection of instruction profile window blocks used for analysis both have non-negligible influences on the accuracy of hybrid analytical models and subsequently propose techniques to account for their effects. We then introduce techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. When modeling pending hits for a processor with unlimited outstanding misses we improve the accuracy of our baseline by a factor of 3.9, decreasing average error from 39.7% to 10.3%. When modeling a processor with data prefetching, a limited number of MSHRs, or both, the techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively.