Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: a geometric programming framework

Authors:
Qiang Liu;George A. Constantinides;Konstantinos Masselos;Peter Y. K. Cheung
Affiliations:
Department of Electrical and Electronic Engineering, Imperial College London, London, UK;Department of Electrical and Electronic Engineering, Imperial College London, London, UK;Department of Computer Science and Technology, University of Peloponnese, Tripolis, Greece;Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Venue:
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Year:
2009

Citing 19
Cited 6

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Automatic parallelization for a class of regular computations

Automatic parallelization for a class of regular computations
Automatic storage management for parallel programs

Parallel Computing - Special issues on languages and compilers for parallel computers
Loop Parallelization

Loop Parallelization
Image and Video Compression Standards: Algorithms and Architectures

Image and Video Compression Standards: Algorithms and Architectures
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Convex Optimization

Convex Optimization
Input data reuse in compiling window operations onto reconfigurable hardware

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing)

Advanced Computer Architecture and Parallel Processing (Wiley Series on Parallel and Distributed Computing)
A Register Allocation Algorithm in the Presence of Scalar Replacement for Fine-Grain Configurable Architectures

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Maximizing data reuse for minimizing memory space requirements and execution cycles

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies

Proceedings of the 43rd annual Design Automation Conference
A practical dynamic single assignment transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic On-chip Memory Minimization for Data Reuse

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Hierarchical algorithm partitioning at system level for an improved utilization of memory structures

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

PM-COSYN: PE and memory co-synthesis for MPSoCs

Proceedings of the Conference on Design, Automation and Test in Europe
Combined loop transformation and hierarchy allocation for data reuse optimization

Proceedings of the International Conference on Computer-Aided Design
Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
Improving communication latency with the write-only architecture

Journal of Parallel and Distributed Computing
Polyhedral-based data reuse optimization for configurable computing

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Combining computation and communication optimizations in system synthesis for streaming applications

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Quantified Score

Hi-index	0.03

Visualization

Abstract

A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performanceoptimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.