Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing memory bandwidth with wide buses: compiler, hardware and performance trade-offs
ICS '97 Proceedings of the 11th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
On the value locality of store instructions
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing Memory Traffic Via Redundant Store Instructions
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Predictive sequential associative cache
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Instruction Pre-Processing in Trace Processors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Characterization of Silent Stores
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Proceedings of the 30th annual international symposium on Computer architecture
DISE: a programmable macro engine for customizing applications
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 32nd annual international symposium on Computer Architecture
RENO: A Rename-Based Instruction Optimizer
Proceedings of the 32nd annual international symposium on Computer Architecture
Hi-index | 0.01 |
The number of pipeline stages separating dynamic instruction scheduling from instruction execution has increased considerably in recent out-of-order microprocessor implementations, forcing the scheduler to allocate functional units and other execution resources several cycles before they are actually used. Unfortunately, several proposed microarchitectural optimizations become less desirable or even impossible in such an environment, since they require instantaneous or near-instantaneous changes in execution behavior and resource usage in response to dynamic events that occur during instruction execution. Since they are detected several cycles after scheduling decisions have already been made, such dynamic responses are infeasible. To overcome this limitation, we propose to implement optimizations by performing what we call speculative decode. Speculative decode alters the mapping between user-visible instructions and the implemented core instructions based on observed runtime characteristics and generates speculative instruction sequences. In these sequences, optimizations are pre-scheduled in a manner compatible with realistic pipelines with multicycle scheduling latency. We present case studies on memory reference combining and silent store squashing, and demonstrate that speculative decode performs comparably or even better than impractical in-core implementations that require zero-cycle scheduling latency.