Instruction Reuse in SPEC, media and packet processing benchmarks: A comparative study of power, performance and related microarchitectural optimizations

  • Authors:
  • G. Surendra;Subhasis Banerjee;S. K. Nandy

  • Affiliations:
  • (Correspd. surendra@cadl.iisc.ernet.in) CAD Lab, SERC, Indian Institute of Science, Bangalore 560012, India. E-mail: surendra@cadl.iisc.ernet.in/ subhasis@cadl.iisc.ernet.in/ nandy@serc.iisc.ernet ...;CAD Lab, SERC, Indian Institute of Science, Bangalore 560012, India. E-mail: surendra@cadl.iisc.ernet.in/ subhasis@cadl.iisc.ernet.in/ nandy@serc.iisc.ernet.in;CAD Lab, SERC, Indian Institute of Science, Bangalore 560012, India. E-mail: surendra@cadl.iisc.ernet.in/ subhasis@cadl.iisc.ernet.in/ nandy@serc.iisc.ernet.in

  • Venue:
  • Journal of Embedded Computing - Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The effectiveness of Instruction Reuse (IR) - a technique to eliminate redundant computations at run time - is limited by the fact that performance gain seldom exceeds 3% and is dependent on the criticality of instructions being "reused". In this paper, we focus on the power aspect of IR and propose a "resultbus optimization" that exploits communication reuse to reduce the power dissipated over a high capacitance resultbus. The effectiveness of this optimization depends on the number of result producing instructions that are reused and improves overall power and Energy-Delay Product (EDP) by 3% over a base IR policy for a 1024 entry "Reuse Buffer" (RB). As a domain specific study, we examine the impact of multithreading on IR in the context of packet header processing applications. Specifically, sharing the RB among threads can lead to either constructive or destructive interference, thereby increasing or decreasing the amount of IR that can be uncovered. Further, packet header processing applications are unique in the sense that repetition in data values within "flows" are quite prevalent which can be exploited to improve IR. We find that an architecture that uses this "flow" information to govern accesses to the RB improves IR by as much as 4.6% for header processing kernels.