A case for a complexity-effective, width-partitioned microarchitecture

  • Authors:
  • Olivier Rochecouste;Gilles Pokam;André Seznec

  • Affiliations:
  • IRISA/INRIA, Rennes Cedex, France;University of California, San Diego, La Jolla, CA;IRISA/INRIA, Rennes Cedex, France

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO)
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The analysis of program executions reveals that most integer and multimedia applications make heavy use of narrow-width operations, i.e., instructions exclusively using narrow-width operands and producing a narrow-width result. Moreover, this usage is relatively well distributed over the application. We observed this program property on the MediaBench and SPEC2000 benchmarks with about 40% of the instructions being narrow-width operations. Current superscalar processors use 64-bit datapaths to execute all the instructions of the applications. In this paper, we suggest the use of a width-partitioned microarchitecture (WPM) to master the hardware complexity of a superscalar processor. For a four-way issue machine, we split the processor in two two-way clusters: the main cluster executing 64-bit operations, load/store, and complex operations and a narrow cluster executing the 16-bit operations. We resort to partitioning to decouple the treatment of the narrow-width operations from that of the other program instructions. This provides the benefit of greatly simplifying the design of the critical processor components in each cluster (e.g., the register file and the bypass network). The dynamic interleaving of the two instruction types allows maintaining the workload balanced among clusters. WPM also helps to reduce the complexity of the interconnection fabric and of the issue logic. In fact, since the 16-bit cluster can only communicate narrow-width data, the datapath-width of the interconnect fabric can be significantly reduced, yielding a corresponding saving of the interconnect power and area. We explore different possible configurations of WPM, discussing the various implementation tradeoffs. We also examine a speculative steering heuristic to distribute the narrow-width operations among clusters. A detailed analysis of the complexity factors shows using WPM instead of a classical 64-bit two-cluster microarchitecture can save power and silicon area with a minimal impact on the overall performance.