Global register allocation at link time
SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Streamlining data cache access with fast address calculation
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Zero-cycle loads: microarchitecture support for reducing load latency
Proceedings of the 28th annual international symposium on Microarchitecture
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Minimum cost interprocedural register allocation
POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Eliminating operand read latency
ACM SIGARCH Computer Architecture News
Exceeding the dataflow limit via value prediction
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Speculative execution via address prediction and data prefetching
ICS '97 Proceedings of the 11th international conference on Supercomputing
Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The predictability of data values
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Memory dependence prediction using store sets
Proceedings of the 25th annual international symposium on Computer architecture
Value locality and speculative execution
Value locality and speculative execution
Predictive techniques for aggressive load speculation
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Compiler-directed early load-address generation
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Understanding the differences between value prediction and instruction reuse
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Correlated load-address predictors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory dependence prediction
Design and evaluation of a multiscalar processor
Design and evaluation of a multiscalar processor
The potential of using dynamic information flow analysis in data value prediction
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 14.98 |
We observe that typical programs exhibit highly regular read-after-read (RAR) memory dependence streams. To exploit this regularity, we introduce read-after-read (RAR) memory dependence prediction. This technique predicts whether: 1) A load will access a memory location that a preceding load accesses and 2) exactly which this preceding load is. This prediction is done without actual knowledge of the corresponding memory addresses. We also present two techniques that utilize RAR memory dependence prediction to reduce memory latency. In the first technique, a load may obtain a value by naming a preceding load with which an RAR dependence is predicted. The second technique speculatively converts a series of ${\rm LOAD}_1{\hbox{-}}{\rm USE}_1,\ldots,{\rm LOAD_N}{\hbox{-}}{\rm USE_N}$ chains into a single ${\rm LOAD}_1{\hbox{-}}{\rm USE}_1\ldots{\rm USE_N}$ producer/consumer graph. This is done whenever RAR dependences are predicted among the ${\rm LOAD_i}$ instructions. Our techniques can be implemented as small extensions to the previously proposed read-after-write (RAW) dependence prediction-based speculative memory cloaking and speculative memory bypassing. On average, our RAR-based techniques provide correct values for an additional 20 percent (integer codes) and 30 percent (floating-point codes) of all loads. Moreover, a combined RAW- and RAR-based cloaking/bypassing mechanism improves performance by 6.44 percent (integer) and 4.66 percent (floating-point) over a highly aggressive dynamically scheduled superscalar processor that uses naive memory dependence speculation. By comparison, the original RAW-based cloaking/bypassing mechanism yields improvements of 4.28 percent (integer) and 3.20 percent (floating-point). When no memory dependence speculation is used, our techniques yield speedups of 9.85 percent (integer) and 6.14 percent (floating-point).