Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

  • Authors:
  • Hui Wang;Rama Sangireddy

  • Affiliations:
  • High Performance Dependable Computing Laboratory, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA;High Performance Dependable Computing Laboratory, Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75083, USA

  • Venue:
  • Microprocessors & Microsystems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to extract instruction level parallelism (ILP). To exacerbate the problems, instructions depending on long latency load operations continue to reside in the IQ until their source operands are ready. Thus, such delayed instructions block any new instructions from entering the IQ even if potentially they are ready for execution. The growing disparity in processor and memory speeds is further aggravating the delay in dislodging instructions from IQ. To alleviate the problem, in this paper we propose a novel technique to streamline instructions in separate buffers according to the chain of dependencies. Each instruction is streamlined behind a parent instruction while it waits for the source operand to be supplied by the long latency memory operations. These instructions are segregated from the IQ and thus the pressure on IQ is relieved which enables flow of potentially executable instructions in the pipeline. Our analysis of SPEC2000 programs reveals that instructions dependent on load cache misses or their dependents, typically have their first source operand ready within 5-15% of their total wait time in the IQ. Based on the observations, the long latency memory dependent instructions are streamlined into in-order buffers when their first operand is ready. In the proposed architecture, instructions from both the conventional IQ and the heads of the streamline buffers can be selected for execution, while the wakeup logic complexity remains same as in the conventional design. Our results show that the performance speedup of 32-entry IQ supplemented by 32 in-order buffers is 15.7% and 2% for FP and integer benchmark respectively, which is very much comparable to that of a conventional 64-entry IQ. A 64-entry IQ design can gain performance over a 32-entry IQ, albeit with a large overhead in circuit delay complexity of wakeup logic, while streamline buffers can gain performance over 32-entry IQ without any such overhead.