Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors

Authors:
Waibhav Tembe;Santosh Pande
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
2002

Citing 21
Cited 2

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiling for numa parallel machines

Compiling for numa parallel machines
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Array data flow analysis for load-store optimizations in fine-grain architectures

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
DSP Processors Hit the Mainstream

Computer
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
An Efficient Data Partitioning Method for Limited Memory Embedded Systems

LCTES '98 Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems
Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Memory Organization for Improved Data Cache Performance in Embedded Processors

ISSS '96 Proceedings of the 9th international symposium on System synthesis
Automatic Blocking of Nested Loops

Automatic Blocking of Nested Loops

Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Power optimization with power islands synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

In this paper, we propose a framework for analyzing the flow of values and their reuse in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop nests intra-procedurally and then performs loop distribution. The analysis discovers the closeness factor of two statements which is a quantitative measure of data traffic saved per unit memory occupied if the statements, were under the same loop nest over the case where they are under different loop nests. We then develop a greedy algorithm which traverses the program dependence graph (PDG) to group statements together under the same loop nest legally to promote maximal reuse per unit of memory occupied. We implemented our framework in Petit, a tool for dependence analysis and loop transformations. We compared our method with one based on tiling of fused loop nest and one based on a greedy strategy to purely maximize reuse. We show that our methods work better than both of these strategies in most cases for processors such as TMS320Cxx, which have a very limited amount of on-chip memory. The improvements in data I/O range from 10 to 30 percent over tiling and from 10 to 40 percent over maximal reuse for JPEG loops.