Dual path instruction processing

  • Authors:
  • Juan L. Aragón;José González;Antonio González;James E. Smith

  • Affiliations:
  • Universidad de Murcia, Murcia (Spain);Universidad de Murcia, Murcia (Spain);Universitat Politècnica de Catalunya, Barcelona (Spain);University of Wisconsin-Madison, Madison, WI

  • Venue:
  • ICS '02 Proceedings of the 16th international conference on Supercomputing
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The reasons for performance losses due to conditional branch mispredictions are first studied. Branch misprediction penalties are broken into three categories: pipeline-fill penalty, window-fill penalty, and serialization penalty. The first and third of these produce most of the performance loss, but the second is also significant. Previously proposed dual (or multi) path execution methods attempt to reduce all three penalties, but these methods are also quite complex. Most of the complexity is caused by simultaneously executing instructions from multiple paths.A good engineering compromise is to avoid the complexity of multiple path execution by focusing on methods that reduce only the pipeline and window re-fill penalties. Dual Path Instruction Processing (DPIP) is proposed as a simple mechanism that fetches, decodes, and renames, but does not execute, instructions from the alternative path for low confidence predicted branches at the same time as the predicted path is being executed. All the stages of the pipeline front-end are hidden once the misprediction is detected. This method thus targets the pipeline-fill penalty and is shown to achieve a good trade-off between performance and complexity. To reduce the window-fill penalty, we further propose the addition of a pre-scheduling engine that schedules instructions from the alternative path in an estimated execution order. Thus, after a misprediction, a high number of instructions from the alternate path can be immediately issued to execution, achieving an effect similar to very fast re-filling of the window. Performance evaluation of DPIP in a 14-stage superscalar processor (like IBM Power 4) shows an average IPC improvement of up to 10% for the bzip2 benchmark, and an average of 8% for ten benchmarks from the SPECint95 and SPECint2000 suites.