Loop optimization for a class of memory-constrained computations

Authors:
D. Cociorva;J. W. Wilkins;C. Lam;G. Baumgartner;J. Ramanujam;P. Sadayappan
Affiliations:
Dept. of Physics, The Ohio State University, Columbus, OH;Dept. of Physics, The Ohio State University, Columbus, OH;Dept. of Comp. & Info. Sci., The Ohio State University, Columbus, OH;Dept. of Comp. & Info. Sci., The Ohio State University, Columbus, OH;Dept. of Elec. & Comp. Engr., Louisiana State University, Baton Rouge, LA;Dept. of Comp. & Info. Sci., The Ohio State University, Columbus, OH
Venue:
ICS '01 Proceedings of the 15th international conference on Supercomputing
Year:
2001

Citing 30
Cited 12

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Automatic array alignment in data-parallel programs

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
General atomic and molecular electronic structure system

Journal of Computational Chemistry
MOB forms: a class of multilevel block algorithms for dense linear algebra operations

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiling for numa parallel machines

Compiling for numa parallel machines
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler cache optimizations for banded matrix problems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An experimental evaluation of tiling and shackling for memory hierarchy management

ICS '99 Proceedings of the 13th international conference on Supercomputing
Quantifying the multi-level nature of tiling interactions

International Journal of Parallel Programming
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Memory-Optimal Evaluation of Expression Trees Involving Large Objects

HiPC '99 Proceedings of the 6th International Conference on High Performance Computing
Optimal Reordering and Mapping of a Class of Nested-Loops for Parallel Execution

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Collective Loop Fusion for Array Contraction

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Performance optimization of a class of loops implementing multidimensional integrals

Performance optimization of a class of loops implementing multidimensional integrals

Space-time trade-off optimization for a class of electronic structure calculations

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Performance modeling and optimization of parallel out-of-core tensor contractions

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
A New Genetic Algorithm for Loop Tiling

The Journal of Supercomputing
Loop parallelization in multi-dimensional cartesian space

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Memory-constrained communication minimization for a class of array computations

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
A framework for load balancing of tensor contraction expressions via dynamic task partitioning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different space-time trade-offs, are possible. By computing and storing some intermediate arrays, reduction of the number of arithmetic operations is possible, but the size of intermediate temporary arrays may be prohibitively large. Loop fusion can be applied to reduce memory requirements, but that could impede effective tiling to minimize memory access costs. This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays. An algorithm is presented that addresses the selection of tile sizes and choice of loops for fusion, with the objective of minimizing cache misses while keeping the total memory usage within a given limit. Experimental results are reported that demonstrate the effectiveness of the combined loop tiling and fusion transformations performed by using the developed framework.