Combined instruction and loop parallelism in array synthesis for FPGAs

Authors:
Steven Derrien;Sanjay Rajopadhye;Susmita Sur Kolay
Affiliations:
IRISA, Rennes, France;IRISA, Rennes, France;Indian Statistical Institute, Calcutta, India
Venue:
Proceedings of the 14th international symposium on Systems synthesis
Year:
2001

Citing 6
Cited 4

Regular partitioning for synthesizing fixed-size systolic arrays

Integration, the VLSI Journal
Scheduling of Partitioned Regular Algorithms on Processor Arrays with Constrained Resources

ASAP '96 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
High-Level Synthesis of Nonprogrammable Hardware Accelerators

ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
High Performance DES Encryption in Virtex(tm) FPGAs Using Jbits(tm)

FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
Automatic synthesis of systolic arrays from uniform recurrent equations

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Optimal Partitioning for FPGA Based Regular Array Implementations

PARELEC '00 Proceedings of the International Conference on Parallel Computing in Electrical Engineering

Evaluating heuristics in automatically mapping multi-loop applications to FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Hardware Acceleration of HMMER on FPGAs

Journal of Signal Processing Systems
Acceleration of a content-based image-retrieval application on the RDISK cluster

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient realization of data dependencies in algorithm partitioning under resource constraints

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compiling perfect, uniform dependence loops to fpga based co-processors normally yields processor pe arrays where a pe executes one instance of the loop body per clock cycle. We develop a transformation framework in which the derived pe can be systematically and automatically pipelined through retiming. We use well known transformations-skewing and serialization, by which an arbitrary number of registers may be placed at the pe outputs. They are then moved into the pe data-path using standard commerecial circuit retimers. Our experiments (based on performance estimates after place-and-route) have been very encouraging. For a number of examples we have seen dramatic performance improvements: speed increases of an order of magnitude with relatively little (always less than 100%) area overhead.