Regular partitioning for synthesizing fixed-size systolic arrays
Integration, the VLSI Journal
Scheduling of Partitioned Regular Algorithms on Processor Arrays with Constrained Resources
ASAP '96 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
High-Level Synthesis of Nonprogrammable Hardware Accelerators
ASAP '00 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
High Performance DES Encryption in Virtex(tm) FPGAs Using Jbits(tm)
FCCM '00 Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Computing Machines
Automatic synthesis of systolic arrays from uniform recurrent equations
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Optimal Partitioning for FPGA Based Regular Array Implementations
PARELEC '00 Proceedings of the International Conference on Parallel Computing in Electrical Engineering
Evaluating heuristics in automatically mapping multi-loop applications to FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Hardware Acceleration of HMMER on FPGAs
Journal of Signal Processing Systems
Acceleration of a content-based image-retrieval application on the RDISK cluster
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient realization of data dependencies in algorithm partitioning under resource constraints
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
Compiling perfect, uniform dependence loops to fpga based co-processors normally yields processor pe arrays where a pe executes one instance of the loop body per clock cycle. We develop a transformation framework in which the derived pe can be systematically and automatically pipelined through retiming. We use well known transformations-skewing and serialization, by which an arbitrary number of registers may be placed at the pe outputs. They are then moved into the pe data-path using standard commerecial circuit retimers. Our experiments (based on performance estimates after place-and-route) have been very encouraging. For a number of examples we have seen dramatic performance improvements: speed increases of an order of magnitude with relatively little (always less than 100%) area overhead.