Automatic memory partitioning and scheduling for throughput and power optimization

Authors:
Jason Cong;Wei Jiang;Bin Liu;Yi Zou
Affiliations:
University of California, Los Angeles, CA;University of California, Los Angeles, CA;University of California, Los Angeles, CA;University of California, Los Angeles, CA
Venue:
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Year:
2011

Citing 14
Cited 7

Architectural power analysis: the dual bit type method

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Software pipelining

ACM Computing Surveys (CSUR)
Power macromodeling for high level power estimation

DAC '97 Proceedings of the 34th annual Design Automation Conference
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A recursive algorithm for low-power memory partitioning

ISLPED '00 Proceedings of the 2000 international symposium on Low power electronics and design
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Power Aware Variable Partitioning and Instruction Scheduling for Multiple Memory Banks

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Memory access scheduling and binding considering energy minimization in multi-bank memory systems

Proceedings of the 41st annual Design Automation Conference
Storage assignment during high-level synthesis for configurable architectures

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions

Algorithmica
Lithographic aerial image simulation with FPGA-based hardwareacceleration

Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
A compiler approach to managing storage and memory bandwidth in configurable architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic memory partitioning and scheduling for throughput and power optimization

Proceedings of the 2009 International Conference on Computer-Aided Design

Memory partitioning and scheduling co-optimization in behavioral synthesis

Proceedings of the International Conference on Computer-Aided Design
Automatic multidimensional memory partitioning for FPGA-based accelerators (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Memory partitioning for multidimensional arrays in high-level synthesis

Proceedings of the 50th Annual Design Automation Conference
Near-optimal and scalable intrasignal in-place optimization for non-overlapping and irregular access schemes

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Theory and algorithm for generalized memory partitioning in high-level synthesis

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
From design to design automation

Proceedings of the 2014 on International symposium on physical design
A scalable and near-optimal representation of access schemes for memory management

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory bottleneck has become a limiting factor in satisfying the explosive demands on performance and cost in modern embedded system design. Selected computation kernels for acceleration are usually captured by nest loops, which are optimized by state-of-the-art techniques like loop tiling and loop pipelining. However, memory bandwidth bottlenecks prevent designs from reaching optimal throughput with respect to available parallelism. In this paper we present an automatic memory partitioning technique which can efficiently improve throughput and reduce energy consumption of pipelined loop kernels for given throughput constraints and platform requirements. Also, our proposed algorithm can handle general array access beyond affine array references. Our partition scheme consists of two steps. The first step considers cycle accurate scheduling information to meet the hard constraints on memory bandwidth requirements specifically for synchronized hardware designs. An ILP formulation is proposed to solve the memory partitioning and scheduling problem optimally for small designs, followed by a heuristic algorithm which is more scalable and equally effective for solving large scale problems. Experimental results show an average 6× throughput improvement on a set of real-world designs with moderate area increase (about 45% on average), given that less resource sharing opportunities exist with higher throughput in optimized designs. The second step further partitions the memory banks for reducing the dynamic power consumption of the final design. In contrast to previous approaches, our technique can statically compute memory access frequencies in polynomial time with little or no profiling. Experimental results show about 30% power reduction on the same set of benchmarks.