Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

Authors:
Kevin Skadron;Pritpal S. Ahuja;Margaret Martonosi;Douglas W. Clark
Affiliations:
Univ. of Virginia, Charlottesville;Compaq Computer Corp., Shrewbury, MA;Princeton Univ., Princeton, NJ;Princeton Univ., Princeton, NJ
Venue:
IEEE Transactions on Computers
Year:
1999

Citing 53
Cited 23

Instruction issue logic for high-performance, interruptable pipelined processors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems

IEEE Transactions on Computers
The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-Offs

Computer
Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Limits on multiple instruction issue

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A model for estimating trace-sample miss ratios

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving the accuracy of dynamic branch prediction using branch correlation

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Predicting conditional branch directions from previous runs of a program

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
A comparison of dynamic branch predictors that use two levels of branch history

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Effectiveness of trace sampling for performance debugging tools

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Reducing indirect function call overhead in C++ programs

POPL '94 Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Fast and accurate instruction fetch and branch prediction

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The effect of speculatively updating branch history on branch prediction accuracy, revisited

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the accuracy of static branch prediction using branch correlation

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A comparative analysis of schemes for correlated branch prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Alternative implementations of hybrid branch predictors

Proceedings of the 28th annual international symposium on Microarchitecture
Correlation and aliasing in dynamic branch predictors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Analysis techniques for predicated code

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Compiler synthesized dynamic branch prediction

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Wrong-path instruction prefetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Data prefetching on the HP PA-8000

Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps

Proceedings of the 24th annual international symposium on Computer architecture
The agree predictor: a mechanism for reducing negative branch history interference

Proceedings of the 24th annual international symposium on Computer architecture
Trading conflict and capacity aliasing in conditional branch predictors

Proceedings of the 24th annual international symposium on Computer architecture
A language for describing predictors and its application to automatic synthesis

Proceedings of the 24th annual international symposium on Computer architecture
Run-time adaptive cache hierarchy management via reference analysis

Proceedings of the 24th annual international symposium on Computer architecture
The bi-mode branch predictor

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Multipath execution: opportunities and limits

ICS '98 Proceedings of the 12th international conference on Supercomputing
Kin: a high performance asynchronous processor architecture

ICS '98 Proceedings of the 12th international conference on Supercomputing
An analysis of correlation and predictability: what makes two-level branch predictors work

Proceedings of the 25th annual international symposium on Computer architecture
Accurate indirect branch prediction

Proceedings of the 25th annual international symposium on Computer architecture
Threaded multiple path execution

Proceedings of the 25th annual international symposium on Computer architecture
Selective eager execution on the PolyPath architecture

Proceedings of the 25th annual international symposium on Computer architecture
Recovery requirements of branch prediction storage structures in the presence of mispredicted-path execution

International Journal of Parallel Programming
The YAGS branch prediction scheme

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Load latency tolerance in dynamically scheduled processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Improving prediction for procedure returns with return-address-stack repair mechanisms

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Design Issues and Tradeoffs for Write Buffers

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Alpha 21264 Microprocessor Architecture

ICCD '98 Proceedings of the International Conference on Computer Design
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance

ACM Transactions on Computer Systems (TOCS)
A time-stamping algorithm for efficient performance estimation of superscalar processors

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Choosing representative slices of program execution for microarchitecture simulations: a preliminary application to the data stream

Workload characterization of emerging computer applications
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A microprocessor survey course for learning advanced computer architecture

SIGCSE '02 Proceedings of the 33rd SIGCSE technical symposium on Computer science education
Parallel simulation of chip-multiprocessor architectures

ACM Transactions on Modeling and Computer Simulation (TOMACS)
The performance advantage of applying compression to the memory system

Proceedings of the 2002 workshop on Memory system performance
A Statistically Rigorous Approach for Improving Simulation Methodology

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Multiple-path execution for chip multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Evaluation and choice of various branch predictors for low-power embedded processor

Journal of Computer Science and Technology
Efficient simulation of trace samples on parallel machines

Parallel Computing
Alloyed branch history: combining global and local branch history for robust performance

International Journal of Parallel Programming
Accelerated warmup for sampled microarchitecture simulation

ACM Transactions on Architecture and Code Optimization (TACO)
Optimal sample length for efficient cache simulation

Journal of Systems Architecture: the EUROMICRO Journal
Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations

IEEE Transactions on Computers
Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulation

Journal of Systems and Software - Special issue: Quality software
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Applying Statistical Sampling for Fast and Efficient Simulation of Commercial Workloads

IEEE Transactions on Computers
Future ILP processors

International Journal of High Performance Computing and Networking
Superscalar architecture design for high performance DSP operations

Microprocessors & Microsystems
Finding extreme behaviors in microprocessor workloads

Transactions on High-Performance Embedded Architectures and Compilers IV
Design space exploration of hybrid ultra low power branch predictors

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems

Quantified Score

Hi-index	14.99

Visualization

Abstract

Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Trade-offs among instruction-window size, branch-prediction accuracy, and instruction- and data-cache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or overstate the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed trade-offs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.