Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Utilization of multiport memories in data path synthesis
DAC '93 Proceedings of the 30th international Design Automation Conference
Logic synthesis
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Decomposed software pipelining: a new perspective and a new approach
International Journal of Parallel Programming
A scheduling algorithm for multiport memory minimization in datapath synthesis
ASP-DAC '95 Proceedings of the 1995 Asia and South Pacific Design Automation Conference
A new guaranteed heuristic for the software pipelining problem
ICS '96 Proceedings of the 10th international conference on Supercomputing
Efficient formulation for optimal modulo schedulers
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Adapting software pipelining for reconfigurable computing
CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Data and memory optimization techniques for embedded systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Cycle-time aware architecture synthesis of custom hardware accelerators
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Introduction to the Scheduling Problem
IEEE Design & Test
Compiler-generated communication for pipelined FPGA applications
Proceedings of the 40th annual Design Automation Conference
Swing Modulo Scheduling: A Lifetime-Sensitive Approach
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
From C Programs to the Configure-Execute Model
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Towards a Source Level Compiler: Source Level Modulo Scheduling
ICPPW '06 Proceedings of the 2006 International Conference Workshops on Parallel Processing
Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Finding the Best Compromise in Compiling Compound Loops to Verilog
ISVLSI '08 Proceedings of the 2008 IEEE Computer Society Annual Symposium on VLSI
SDC-based modulo scheduling for pipeline synthesis
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.00 |
In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.