Global optimizations for parallelism and locality on scalable parallel machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Compile-Time Techniques for Data Distribution in Distributed Memory Machines
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
R. Barua, W. Lee, S. Amarasinghe and A. Agarwal
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Automatic memory partitioning and scheduling for throughput and power optimization
Proceedings of the 2009 International Conference on Computer-Aided Design
Automatic memory partitioning and scheduling for throughput and power optimization
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Customizable Domain-Specific Computing
IEEE Design & Test
High-Level Synthesis for FPGAs: From Prototyping to Deployment
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Improving high level synthesis optimization opportunity through polyhedral transformations
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Memory partitioning for multidimensional arrays in high-level synthesis
Proceedings of the 50th Annual Design Automation Conference
Theory and algorithm for generalized memory partitioning in high-level synthesis
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Hi-index | 0.00 |
Achieving optimal throughput by extracting parallelism in behavioral synthesis often exaggerates memory bottleneck issues. Data partitioning is an important technique for increasing memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. In this paper we present a vertical memory partitioning and scheduling algorithm that can generate a valid partition scheme for arbitrary affine memory inputs. It does this by arranging non-conflicting memory accesses across the border of loop iterations. A mixed memory partitioning and scheduling algorithm is also proposed to combine the advantages of the vertical and other state-of-art algorithms. A set of theorems is provided as criteria for selecting a valid partitioning scheme. This is followed by an optimal and scalable memory scheduling algorithm. By utilizing the property of constant strides between memory addresses in successive loop iterations, an address translation optimization technique for an arbitrary partition factor is proposed to improve performance, area and energy efficiency. Experimental results show that on a set of real-world medical image processing kernels, the proposed mixed algorithm with address translation optimization can gain speed-up, area reduction and power savings of 15.8%, 36% and 32.4% respectively, compared to the state-of-art memory partitioning algorithm.