Combining computation and communication optimizations in system synthesis for streaming applications

Authors:
Jason Cong;Muhuan Huang;Peng Zhang
Affiliations:
University of California, Los Angeles, Los Angeles, CA, USA;University of California, Los Angeles, Los Angeles, CA, USA;University of California, Los Angeles, Los Angeles, CA, USA
Venue:
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Year:
2014

Citing 24
Cited 0

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
A novel framework of register allocation for software pipelining

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
ILP-based cost-optimal DSP synthesis with module selection and data format conversion

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Software Synthesis from Dataflow Graphs

Software Synthesis from Dataflow Graphs
Embedded Multiprocessors: Scheduling and Synchronization

Embedded Multiprocessors: Scheduling and Synchronization
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks

Journal of VLSI Signal Processing Systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Minimising buffer requirements of synchronous dataflow graphs with model checking

Proceedings of the 42nd annual Design Automation Conference
Optimal module and voltage assignment for low-power

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
An efficient and versatile scheduling algorithm based on SDC formulation

Proceedings of the 43rd annual Design Automation Conference
Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs

Proceedings of the 43rd annual Design Automation Conference
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An efficient technique for analysis of minimal buffer requirements of synchronous dataflow graphs with model checking

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
A computing origami: folding streams in FPGAs

Proceedings of the 46th Annual Design Automation Conference
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: a geometric programming framework

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Optimal synthesis of latency and throughput constrained pipelined MPSoCs targeting streaming applications

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
FPGA Pipeline Synthesis Design Exploration Using Module Selection and Resource Sharing

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Exploiting just-enough parallelism when mapping streaming applications in hard real-time systems

Proceedings of the 50th Annual Design Automation Conference
Combining module selection and replication for throughput-driven streaming programs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.