Extracting Speedup From C-Code With Poor Instruction-Level Parallelism

  • Authors:
  • Dara Kusic;Raymond Hoare;Alex K. Jones;Joshua Fazekas;John Foster

  • Affiliations:
  • University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania

  • Venue:
  • IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific computing and multimedia applications frequently call loop-intensive functions that dominate execution time. Applying homogeneous, parallel processors (e.g. single-instruction, multiple-data (SIMD) and very-long instruction word (VLIW)) is a common approach to minimizing execution time. However, many benchmark applications offer disappointingdegrees of instruction-level parallelism (ILP) that cause these ar-chitectures to fall short of expected performance gains. This paper presents findings on execution time speedup achieved by heterogeneousmassively parallel processors - standard reduced instruction-set comput-ing (RISC) CPUstightly coupled with arrays of super-complex instruction-set computing (SuperCISC) datapaths on the same chip. SuperCISC datapaths are created by mapping frequently-called functions into reconfigurable hardware. Encouraging performance results from the RISC/SuperCISC architecture point to the efficiency of reconfigurable devices to support large numbers of parallel computational accelerators. Calls to SuperCISC functions can greatly expedite execution time when applied to CPUs that support extensible in-struction sets. In this paper we show how SuperCISC functions can accelerate an application up to 25x over a 4-way VLIW. SuperCISC functions show superlinear speedup, a per-formance gain significantly greater than the software's ILP. SuperCISC functions also benefit from cycle com-pression, or a reduction of the idle cycle time for an operation to execute within a traditional CPU. Imple-menting software controls, or if-then-else statements, as hardware multiplexers within a SuperCISC function further advances performance.