Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

Authors:
Hui Wang;Rama Sangireddy
Affiliations:
High Performance Dependable Computing Laboratory, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA;High Performance Dependable Computing Laboratory, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA
Venue:
Microprocessors & Microsystems
Year:
2008

Citing 19
Cited 0

Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Using SimPoint for accurate and efficient simulation

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Half-price architecture

Proceedings of the 30th annual international symposium on Computer architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Continual flow pipelines

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fetch Halting on Critical Load Misses

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Out-of-Order Commit Processors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Instruction packing: reducing power and delay of the dynamic scheduling logic

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Reducing Rename Logic Complexity for High-Speed and Low-Power Front-End Architectures

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to extract instruction level parallelism (ILP). To exacerbate the problems, instructions depending on long latency load operations continue to reside in the IQ until their source operands are ready. Thus, such delayed instructions block any new instructions from entering the IQ even if potentially they are ready for execution. The growing disparity in processor and memory speeds is further aggravating the delay in dislodging instructions from IQ. To alleviate the problem, in this paper we propose a novel technique to streamline instructions in separate buffers according to the chain of dependencies. Each instruction is streamlined behind a parent instruction while it waits for the source operand to be supplied by the long latency memory operations. These instructions are segregated from the IQ and thus the pressure on IQ is relieved which enables flow of potentially executable instructions in the pipeline. Our analysis of SPEC2000 programs reveals that instructions dependent on load cache misses or their dependents, typically have their first source operand ready within 5-15% of their total wait time in the IQ. Based on the observations, the long latency memory dependent instructions are streamlined into in-order buffers when their first operand is ready. In the proposed architecture, instructions from both the conventional IQ and the heads of the streamline buffers can be selected for execution, while the wakeup logic complexity remains same as in the conventional design. Our results show that the performance speedup of 32-entry IQ supplemented by 32 in-order buffers is 15.7% and 2% for FP and integer benchmark respectively, which is very much comparable to that of a conventional 64-entry IQ. A 64-entry IQ design can gain performance over a 32-entry IQ, albeit with a large overhead in circuit delay complexity of wakeup logic, while streamline buffers can gain performance over 32-entry IQ without any such overhead.