Improving high level synthesis optimization opportunity through polyhedral transformations

Authors:
Wei Zuo;Yun Liang;Peng Li;Kyle Rupnow;Deming Chen;Jason Cong
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;Peking University, Beijing, China;Peking University, Beijing, China;Advanced Digital Science Center, Singapore, Singapore;University of Illinois at Urbana-Champaign, Urbanan, IL, USA;University of California, Los Angeles, Los Angeles, CA, USA
Venue:
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Year:
2013

Citing 20
Cited 2

An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Compiler-generated communication for pipelined FPGA applications

Proceedings of the 40th annual Design Automation Conference
Behavior and communication co-optimization for systems with sequential communication media

Proceedings of the 43rd annual Design Automation Conference
DRDU: A data reuse analysis technique for efficient scratch-pad memory management

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
High-Level Synthesis: from Algorithm to Digital Circuit

High-Level Synthesis: from Algorithm to Digital Circuit
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Loop transformations: convexity, pruning and optimization

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
LegUp: high-level synthesis for FPGA-based processor/accelerator systems

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Customizable Domain-Specific Computing

IEEE Design & Test
Multilevel Granularity Parallelism Synthesis on FPGAs

FCCM '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
Accelerating Fluid Registration Algorithm on Multi-FPGA Platforms

FPL '11 Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications
Optimizing SDRAM bandwidth for custom FPGA loop accelerators

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
High-level synthesis: productivity, performance, and software constraints

Journal of Electrical and Computer Engineering - Special issue on ESL Design Methodology
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
High-Level Synthesis for FPGAs: From Prototyping to Deployment

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Memory partitioning and scheduling co-optimization in behavioral synthesis

Proceedings of the International Conference on Computer-Aided Design
Polyhedral-based data reuse optimization for configurable computing

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Polyhedral-based data reuse optimization for configurable computing

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Improving polyhedral code generation for high-level synthesis

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.