Parallelizing load/stores on dual-bank memory embedded processors

Authors:
Xiaotong Zhuang;Santosh Pande
Affiliations:
Georgia Institute of Technology, Atlanta, Georgia;Georgia Institute of Technology, Atlanta, Georgia
Venue:
ACM Transactions on Embedded Computing Systems (TECS)
Year:
2006

Citing 15
Cited 0

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing

Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing
Lazy code motion

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Improvements to graph coloring register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Memory access coalescing: a technique for eliminating redundant memory accesses

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Exploiting dual data-memory banks in digital signal processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Iterated register coalescing

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Graph theory and its applications

Graph theory and its applications
Enhanced code compression for embedded RISC processors

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Simultaneous reference allocation in code generation for dual data memory bank ASIPs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Efficient register and memory assignment for non-orthogonal architectures via graph coloring and MST algorithms

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
A Framework for Parallelizing Load/Stores on Embedded Processors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Variable partitioning for dual memory bank DSPs

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many modern embedded processors such as DSPs support partitioned memory banks (also called X--Y memory or dual-bank memory) along with parallel load/store instructions to achieve higher code density and performance. In order to effectively utilize the parallel load/store instructions, the compiler must partition the memory-resident values and assign them to X or Y bank. This paper gives a postregister allocation solution to merge the generated load/store instructions into their parallel counterparts. Simultaneously, our framework performs allocation of values to X or Y memory banks. We first remove as many load/stores and register--register moves as possible through an excellent iterated coalescing based register allocator by Appel and George [1996]. We then attempt to parallelize the generated load/stores using a multipass approach. The basic phase of our approach attempts the merger of load/stores without duplication and web splitting. We model this problem as a graph-coloring problem in which each value is colored as either X or Y. We then construct a motion scheduling graph (MSG), based on the range of motion for each load/store instruction. MSG reflects potential instructions that could be merged. We propose a notion of pseudofixed boundaries so that the load/store movement is less affected by register dependencies. We prove that the coloring problem for MSG is NP-complete and solve it with two different heuristic algorithms with different complexity. We then propose a two-level iterative process to attempt instruction duplication, variable duplication, web splitting, and local conflict elimination to effectively merge the remaining load/stores. Finally, we clean up some multiple-aliased load/stores. To improve the performance, we combine profiling information with each stage coupled with some modifications to the algorithm. We show that our framework results in parallelization of a large number of load/stores without much growth in data and code segments. The average speedup for our optimization pass reaches roughly 13% if no profile information is available and 17% with profile information. The average code and data segment growth is controlled within 13%.