A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

Authors:
Kathryn S. McKinley
Affiliations:
Univ. of Massachusetts, Amherst
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1998

Citing 35
Cited 12

Interprocedural dependence analysis and parallelization

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Direct parallelization of call statements

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
A framework for determining useful parallelism

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Efficient interprocedural analysis for program parallelization and restructuring

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
A technique for summarizing data access and its use in parallelism enhancing transformations

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
A theory of loop permutations

Selected papers of the second workshop on Languages and compilers for parallel computing
Parallel algorithms for banded linear systems

SIAM Journal on Scientific and Statistical Computing
Practical dependence testing

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Interprocedural transformations for parallel code generation

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Stable parallel algorithms for two-point boundary value problems

SIAM Journal on Scientific and Statistical Computing
Finding and exploiting parallelism in an ocean simulation program: experience, results, and implications

Journal of Parallel and Distributed Computing
Algorithm 711: BTN: software for parallel unconstrained optimization

ACM Transactions on Mathematical Software (TOMS)
Automatic partitioning of a program dependence graph into parallel tasks

IBM Journal of Research and Development
A general framework for iteration-reordering loop transformations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Analysis of synchronization in a parallel programming environment

Analysis of synchronization in a parallel programming environment
The cedar system and an initial performance study

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Automatic and interactive parallelization

Automatic and interactive parallelization
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Detecting coarse-grain parallelism using an interprocedural parallelizing compiler

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Compiler reduction of synchronisation in shared virtual memory systems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A quantitative analysis of loop nest locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
On the Automatic Parallelization of the Perfect Benchmarks®

IEEE Transactions on Parallel and Distributed Systems
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Program Transformation for Locality Using Affinity Regions

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing

A preprocessing step for global loop transformations for data transfer optimization

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Loop optimization for a class of memory-constrained computations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Interprocedural Transformations for Extracting Maximum Parallelism

ADVIS '02 Proceedings of the Second International Conference on Advances in Information Systems
Estimating cache misses and locality using stack distances

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Reuse-distance-based miss-rate prediction on a per instruction basis

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Performance Modeling of Communication and Computation in Hybrid MPI and OpenMP Applications

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
Optimizing performance of parallel programs on multicomputer and multi-core architectures: a comparative evaluation

Proceedings of the 2009 conference on Information Science, Technology and Applications
Compiler directed parallelization of loops in scale for shared-memory multiprocessors

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Low power engineering

Embedded Systems Design
Locality optimized shared-memory implementations of iterated runge-kutta methods

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.