POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
An FPGA implementation and performance evaluation of the Serpent block cipher
FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
An automated process for compiling dataflow graphs into reconfigurable hardware
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special issue on low power electronics and design
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Automatic Extraction of Functional Parallelism from Ordinary Programs
IEEE Transactions on Parallel and Distributed Systems
Estimation of Nested Loops Execution Time by Integer Arithmetic in Convex Polyhedra
Proceedings of the 8th International Symposium on Parallel Processing
Parallelization of Non-Simultaneous Iterative Methods for Systems of Linear Equations
CONPAR 94 - VAPP VI Proceedings of the Third Joint International Conference on Vector and Parallel Processing: Parallel Processing
Automated target recognition on SPLASH 2
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Automated Mapping of the MapReduce Pattern onto Parallel Computing Platforms
Journal of Signal Processing Systems
Hi-index | 0.00 |
This paper presents a technique for automatic synthesis of high-performance FPGA-based computing machines from C language source code. It exploits data-parallelism present in source code, and its approach is based on hardware application of techniques for automatic loop transformations, mainly designed in the area of optimizing compilers for parallel and vector computers. Performance aspects are considered in early stage of design, before low-level synthesis process, through a transformation-intensive branch-and-bound approach, that searches design space exploring area-performance tradeoffs. Furthermore optimizations are applied at architectural level, thus achieving higher benefits with respect to gate-level optimizations, also by means of a library of hardware blocks implementing arithmetic and functional primitives. Application of the technique to partial and complete unrolling of a Successive Over-Relaxation code is presented, with results in terms of effectiveness of area-delay estimation, and speed-up for the generated circuit, ranging from 5 and 30 on a Virtex-E 2000-6 with respect to a Intel Pentium 3 1GHz.