Optimizing for parallelism and data locality

Authors:
Ken Kennedy;Kathryn S. McKinley
Affiliations:
-;-
Venue:
ICS '92 Proceedings of the 6th international conference on Supercomputing
Year:
1992

Citing 22
Cited 56

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
A global approach to detection of parallelism

A global approach to detection of parallelism
Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Data dependence and its application to parallel processing

International Journal of Parallel Programming
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A theory of loop permutations

Selected papers of the second workshop on Languages and compilers for parallel computing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A static performance estimator in the Fortran D programming system

Languages, compilers and run-time environments for distributed memory machines
Automatic and interactive parallelization

Automatic and interactive parallelization
Supercomputer performance evaluation and the Perfect Benchmarks

ICS '90 Proceedings of the 4th international conference on Supercomputing
The parallel execution of DO loops

Communications of the ACM
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing Supercompilers for Supercomputers

Optimizing Supercompilers for Supercomputers
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Improving the performance of virtual memory computers.

Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines

ICS '92 Proceedings of the 6th international conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Managing pages in shared virtual memory systems: getting the compiler into the game

ICS '93 Proceedings of the 7th international conference on Supercomputing
A static parameter based performance prediction tool for parallel programs

ICS '93 Proceedings of the 7th international conference on Supercomputing
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unified compilation of Fortran 77D and 90D

ACM Letters on Programming Languages and Systems (LOPLAS)
Exploiting the parallelism available in loops

Computer
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

ICS '94 Proceedings of the 8th international conference on Supercomputing
Using virtual lines to enhance locality exploitation

ICS '94 Proceedings of the 8th international conference on Supercomputing
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
A compiler optimization to reduce execution time of loop nest

ACM SIGARCH Computer Architecture News
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The implementation and evaluation of fusion and contraction in array languages

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A general algorithm for tiling the register level

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests

Proceedings of the 14th international conference on Supercomputing
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
A compiler technique for improving whole-program locality

POPL '01 Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Reducing memory requirements of nested loops for embedded systems

Proceedings of the 38th annual Design Automation Conference
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests

International Journal of Parallel Programming
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
A Simulation Study of Decoupled Vector Architectures

The Journal of Supercomputing
Quantifying the Multi-Level Nature of Tiling Interactions

International Journal of Parallel Programming
Effects of Loop Fusion and Statement Migration on the Speedup of Vector Multiprocessors

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Transforming Complex Loop Nests for Locality

The Journal of Supercomputing
Scheduling Reusable Instructions for Power Reduction

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
Line Size Adaptivity Analysis of Parameterized Loop Nests for Direct Mapped Data Cache

IEEE Transactions on Computers
Maximizing data reuse for minimizing memory space requirements and execution cycles

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Multi-compilation: capturing interactions among concurrently-executing applications

Proceedings of the 3rd conference on Computing frontiers
An extended OpenMP targeting on the hybrid architecture of SMP-cluster

ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
Single-dimension software pipelining for multidimensional loops

ACM Transactions on Architecture and Code Optimization (TACO)
$P$^$3$$T+$: A performance estimator for distributed and parallel programs

Scientific Programming
Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications

Journal of Signal Processing Systems
Compiler directed parallelization of loops in scale for shared-memory multiprocessors

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
On the interaction of tiling and automatic parallelization

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems
An ILP-Based approach to locality optimization

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the trade-offs between effectively utilizing parallelism and memory hierarchy on shared-memory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results.