Rapid VLIW processor customization for signal processing applications using combinational hardware functions

Authors:
Raymond R. Hoare;Alex K. Jones;Dara Kusic;Joshua Fazekas;John Foster;Shenchih Tung;Michael McCloud
Affiliations:
Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA;Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2006

Citing 35
Cited 11

Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
High-level transformations for minimizing syntactic variances

DAC '93 Proceedings of the 30th international Design Automation Conference
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Managing pipeline-reconfigurable FPGAs

FPGA '98 Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays
High-performance carry chains for FPGAs

FPGA '98 Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays
A video signal processor for MIMD multiprocessing

DAC '98 Proceedings of the 35th annual Design Automation Conference
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ECL: a specification environment for system-level design

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Automatic test pattern generation for functional RTL circuits using assignment decision diagrams

Proceedings of the 37th Annual Design Automation Conference
The Garp Architecture and C Compiler

Computer
PipeRench: A Reconfigurable Architecture and Compiler

Computer
The Olympus Synthesis System

IEEE Design & Test
Imagine: Media Processing with Streams

IEEE Micro
RaPiD - Reconfigurable Pipelined Datapath

FPL '96 Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers
Profiling tools for hardware/software partitioning of embedded applications

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
PACT HDL: a compiler targeting ASICS and FPGAS with power and performance optimizations

Power aware computing
Architecture Design of Reconfigurable Pipelined Datapaths

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
The Chimaera reconfigurable functional unit

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Incremental reconfiguration for pipelined applications

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Mapping applications to the RaPiD configurable architecture

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
RVC - A Reconfigurable Coprocessor for Vector Processing Applications

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Specifying and Compiling Applications for RaPiD

FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
A MATLAB Compiler for Distributed, Heterogeneous, Reconfigurable Computing Systems

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Kernel Formation in Garpcc

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Behavioral Synthesis of Data-Dominated Circuits for Minimal Energy Implementation

VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
Overview of a compiler for synthesizing MATLAB programs onto FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
On the sphere-decoding algorithm I. Expected complexity

IEEE Transactions on Signal Processing - Part I
Scalar coprocessors for accelerating the G723.1 and G729A speech coders

IEEE Transactions on Consumer Electronics
Using global code motions to improve the quality of results for high-level synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Boundary macroblock padding in MPEG-4 video decoding using a graphics coprocessor

IEEE Transactions on Circuits and Systems for Video Technology

An automated, reconfigurable, low-power RFID tag

Proceedings of the 43rd annual Design Automation Conference
Reducing power while increasing performance with supercisc

ACM Transactions on Embedded Computing Systems (TECS)
An automated, FPGA-based reconfigurable, low-power RFID tag

Microprocessors & Microsystems
Radio frequency identification prototyping

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimizing near-ML MIMO detector for SDR baseband on parallel programmable architectures

Proceedings of the conference on Design, automation and test in Europe
A design automation and power estimation flow for RFID systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Interconnect customization for a hardware fabric

ACM Transactions on Design Automation of Electronic Systems (TODAES)
VLSI architecture design approaches for real-time video processing

WSEAS Transactions on Circuits and Systems
A low-power CMOS thyristor based delay element with programmability extensions

Proceedings of the 19th ACM Great Lakes symposium on VLSI
A survey of programmable and dedicated approaches in VLSI architecture design for real-time video processing

ICC'08 Proceedings of the 12th WSEAS international conference on Circuits
Design space exploration for low-power reconfigurable fabrics

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an architecture that combines VLIW (very long instruction word) processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1) a 4-way VLIW processor implemented in an FPGA, (2) large speedups through hardware functions, (3) a hardware/software interface with zero overhead, (4) a design methodology for implementing signal processing applications on this architecture, (5) tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to 230 times that of software with an average 63 times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.