Proceedings of the 1989 ACM/IEEE conference on Supercomputing
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Automatic parallelization for a class of regular computations
Automatic parallelization for a class of regular computations
Automatic storage management for parallel programs
Parallel Computing - Special issues on languages and compilers for parallel computers
Loop Parallelization
Image and Video Compression Standards: Algorithms and Architectures
Image and Video Compression Standards: Algorithms and Architectures
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
IEEE Transactions on Parallel and Distributed Systems
Compiler-decided dynamic memory allocation for scratch-pad based embedded systems
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Convex Optimization
Input data reuse in compiling window operations onto reconfigurable hardware
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing)
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Maximizing data reuse for minimizing memory space requirements and execution cycles
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies
Proceedings of the 43rd annual Design Automation Conference
A practical dynamic single assignment transformation
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic On-chip Memory Minimization for Data Reuse
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Hierarchical algorithm partitioning at system level for an improved utilization of memory structures
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
PM-COSYN: PE and memory co-synthesis for MPSoCs
Proceedings of the Conference on Design, Automation and Test in Europe
Combined loop transformation and hierarchy allocation for data reuse optimization
Proceedings of the International Conference on Computer-Aided Design
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis
Proceedings of the 49th Annual Design Automation Conference
Improving communication latency with the write-only architecture
Journal of Parallel and Distributed Computing
Polyhedral-based data reuse optimization for configurable computing
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Combining computation and communication optimizations in system synthesis for streaming applications
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Hi-index | 0.03 |
A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performanceoptimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.