Data locality enhancement by memory reduction

Authors:
Yonghong Song;Rong Xu;Cheng Wang;Zhiyuan Li
Affiliations:
Sun Microsystems, Inc., 901 San Antonio Rd., Palo Alto, CA;Department of Computer Sciences, Purdue University, West Lafayette, IN;Department of Computer Sciences, Purdue University, West Lafayette, IN;Department of Computer Sciences, Purdue University, West Lafayette, IN
Venue:
ICS '01 Proceedings of the 15th international conference on Supercomputing
Year:
2001

Citing 24
Cited 26

Theory of linear and integer programming

Theory of linear and integer programming
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Linear programming and network flows (2nd ed.)

Linear programming and network flows (2nd ed.)
Structured dataflow analysis for arrays and its use in an optimizing complier

Software—Practice & Experience
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Optimization of array accesses by collective loop transformations

ICS '91 Proceedings of the 5th international conference on Supercomputing
Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Array-data flow analysis and its use in array privatization

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Interprocedural array region analyses

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Experience with efficient array data flow analysis for array privatization

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The implementation and evaluation of fusion and contraction in array languages

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Schedule-independent storage mapping for loops

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Optimized unrolling of nested loops

Proceedings of the 14th international conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Structure of Computers and Computations

Structure of Computers and Computations
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
On the Complexity of Loop Fusion

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Loop Alignment for Memory Accesses Optimization

Proceedings of the 12th international symposium on System synthesis
Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse

Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Achieving Scalable Locality with Time Skewing

International Journal of Parallel Programming
Better tiling and array contraction for compiling scientific programs

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
A Constraint Network Based Approach to Memory Layout Optimization

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
New Complexity Results on Array Contraction and Related Problems

Journal of VLSI Signal Processing Systems
Improving whole-program locality using intra-procedural and inter-procedural transformations

Journal of Parallel and Distributed Computing
A polynomial-time algorithm for memory space reduction

International Journal of Parallel Programming
Compiler-directed selective data protection against soft errors

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Polyhedral space generation and memory estimation from interface and memory models of real-time video systems

Journal of Systems and Software
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
Explicit Dependence Metadata in an Active Visual Effects Library

Languages and Compilers for Parallel Computing
Trade-offs in loop transformations

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reducing memory requirements of resource-constrained applications

ACM Transactions on Embedded Computing Systems (TECS)
High-performance SIMT code generation in an active visual effects library

Proceedings of the 6th ACM conference on Computing frontiers
Loop transformations for reducing data space requirements of resource-constrained applications

SAS'03 Proceedings of the 10th international conference on Static analysis
Locality enhancement by array contraction

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Model-guided empirical tuning of loop fusion

International Journal of High Performance Systems Architecture
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
A cache-conscious profitability model for empirical tuning of loop fusion

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Efficient execution of time-step computations with pipelined parallelism and inter-thread data locality optimizaitions

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Memory-constrained communication minimization for a class of array computations

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Memory space conscious loop iteration duplication for reliable execution

SAS'05 Proceedings of the 12th international conference on Static Analysis
Iterative collective loop fusion

CC'06 Proceedings of the 15th international conference on Compiler Construction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose memory reduction as a new approach to data locality enhancement. Under this approach, we use the compiler to reduce the size of the data repeatedly referenced in a collection of nested loops. Between their reuses, the data will more likely remain in higher-speed memory devices, such as the cache. Specifically, we present an optimal algorithm to combine loop shifting, loop fusion and array contraction to reduce the temporary array storage required to execute a collection of loops. When applied to 20 benchmark programs, our technique reduces the memory requirement, counting both the data and the code, by 51% on average. The transformed programs gain a speedup of 1.40 on average, due to the reduced footprint and, consequently, the improved data locality.