A Framework for Loop Distribution on Limited On-Chip Memory Processors

Authors:
Lei Wang;Waibhav Tembe;Santosh Pande
Affiliations:
-;-;-
Venue:
CC '00 Proceedings of the 9th International Conference on Compiler Construction
Year:
2000

Citing 18
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A practical data flow framework for array reference analysis and its use in optimizations

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiling for numa parallel machines

Compiling for numa parallel machines
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Array data flow analysis for load-store optimizations in fine-grain architectures

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Quantifying the multi-level nature of tiling interactions

International Journal of Parallel Programming
Loop Transformations for Restructuring Compilers: The Foundations

Loop Transformations for Restructuring Compilers: The Foundations
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
DSP Processors Hit the Mainstream

Computer
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
An Efficient Data Partitioning Method for Limited Memory Embedded Systems

LCTES '98 Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems
Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Memory Organization for Improved Data Cache Performance in Embedded Processors

ISSS '96 Proceedings of the 9th international symposium on System synthesis

Loop fusion and reordering for register file optimization on stream processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Loop fusion and reordering for register file optimization on stream processors

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work proposes a framework for analyzing the flow of values and their re-use in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop nests intra-procedurally and then performs loop distribution. The analysis discovers the closeness factor of two statements which is a quantitative measure of data traffic saved per unit memory occupied if the statements were under the same loop nest over the case where they are under different loop nests. We then develop a greedy algorithm which traverses the program dependence graph (PDG) to group statements together under the same loop nest legally. The main idea of this greedy algorithm is to transitively generate a group of statements that can legally execute under a given loop nest that can lead to a minimum data traffic. We implemented our framework in Petit, a tool for dependence analysis and loop transformations. We show that the benefit due to our approach results in eliminating as much as 30 % traffic in some cases improving overall completion time by a 23.33 % for processors such as TI's TMS320C5x.