The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A practical data flow framework for array reference analysis and its use in optimizations
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiling for numa parallel machines
Compiling for numa parallel machines
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Array data flow analysis for load-store optimizations in fine-grain architectures
International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Proceedings of the 14th international conference on Supercomputing
Loop Transformations for Restructuring Compilers: The Foundations
Loop Transformations for Restructuring Compilers: The Foundations
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Quantifying the Multi-Level Nature of Tiling Interactions
International Journal of Parallel Programming
DSP Processors Hit the Mainstream
Computer
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
An Efficient Data Partitioning Method for Limited Memory Embedded Systems
LCTES '98 Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems
Compiler Optimizations for Real Time Execution of Loops on Limited Memory Embedded Systems
RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Memory Organization for Improved Data Cache Performance in Embedded Processors
ISSS '96 Proceedings of the 9th international symposium on System synthesis
Automatic Blocking of Nested Loops
Automatic Blocking of Nested Loops
Improving Data Locality by Array Contraction
IEEE Transactions on Computers
Power optimization with power islands synthesis
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 14.98 |
In this paper, we propose a framework for analyzing the flow of values and their reuse in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop nests intra-procedurally and then performs loop distribution. The analysis discovers the closeness factor of two statements which is a quantitative measure of data traffic saved per unit memory occupied if the statements, were under the same loop nest over the case where they are under different loop nests. We then develop a greedy algorithm which traverses the program dependence graph (PDG) to group statements together under the same loop nest legally to promote maximal reuse per unit of memory occupied. We implemented our framework in Petit, a tool for dependence analysis and loop transformations. We compared our method with one based on tiling of fused loop nest and one based on a greedy strategy to purely maximize reuse. We show that our methods work better than both of these strategies in most cases for processors such as TMS320Cxx, which have a very limited amount of on-chip memory. The improvements in data I/O range from 10 to 30 percent over tiling and from 10 to 40 percent over maximal reuse for JPEG loops.