Extracting Speedup From C-Code With Poor Instruction-Level Parallelism

Authors:
Dara Kusic;Raymond Hoare;Alex K. Jones;Joshua Fazekas;John Foster
Affiliations:
University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania;University of Pittsburgh, Pennsylvania
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
Year:
2005

Citing 10
Cited 0

The Garp Architecture and C Compiler

Computer
PipeRench: A Reconfigurable Architecture and Compiler

Computer
RaPiD - Reconfigurable Pipelined Datapath

FPL '96 Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers
PACT HDL: a compiler targeting ASICS and FPGAS with power and performance optimizations

Power aware computing
The Chimaera reconfigurable functional unit

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
A MATLAB Compiler for Distributed, Heterogeneous, Reconfigurable Computing Systems

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Boundary macroblock padding in MPEG-4 video decoding using a graphics coprocessor

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific computing and multimedia applications frequently call loop-intensive functions that dominate execution time. Applying homogeneous, parallel processors (e.g. single-instruction, multiple-data (SIMD) and very-long instruction word (VLIW)) is a common approach to minimizing execution time. However, many benchmark applications offer disappointingdegrees of instruction-level parallelism (ILP) that cause these ar-chitectures to fall short of expected performance gains. This paper presents findings on execution time speedup achieved by heterogeneousmassively parallel processors - standard reduced instruction-set comput-ing (RISC) CPUstightly coupled with arrays of super-complex instruction-set computing (SuperCISC) datapaths on the same chip. SuperCISC datapaths are created by mapping frequently-called functions into reconfigurable hardware. Encouraging performance results from the RISC/SuperCISC architecture point to the efficiency of reconfigurable devices to support large numbers of parallel computational accelerators. Calls to SuperCISC functions can greatly expedite execution time when applied to CPUs that support extensible in-struction sets. In this paper we show how SuperCISC functions can accelerate an application up to 25x over a 4-way VLIW. SuperCISC functions show superlinear speedup, a per-formance gain significantly greater than the software's ILP. SuperCISC functions also benefit from cycle com-pression, or a reduction of the idle cycle time for an operation to execute within a traditional CPU. Imple-menting software controls, or if-then-else statements, as hardware multiplexers within a SuperCISC function further advances performance.