Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

Authors:
Kathryn S. McKinley
Affiliations:
Department of Computer Science, University of Massachusetts, Amherst, MA
Venue:
ICS '94 Proceedings of the 8th international conference on Supercomputing
Year:
1994

Citing 18
Cited 4

The impact of interprocedural analysis and optimization in the Rn programming environment

ACM Transactions on Programming Languages and Systems (TOPLAS)
Interprocedural dependence analysis and parallelization

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Direct parallelization of call statements

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Guide to parallel programming on Sequent computer systems: 2nd edition

Guide to parallel programming on Sequent computer systems: 2nd edition
Efficient interprocedural analysis for program parallelization and restructuring

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
A technique for summarizing data access and its use in parallelism enhancing transformations

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Interprocedural transformations for parallel code generation

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Finding and exploiting parallelism in an ocean simulation program: experience, results, and implications

Journal of Parallel and Distributed Computing
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Analysis of synchronization in a parallel programming environment

Analysis of synchronization in a parallel programming environment
The cedar system and an initial performance study

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Automatic and interactive parallelization

Automatic and interactive parallelization
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing

Dynamic pointer alignment: tiling and communication optimizations for parallel pointer-based computations

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
Efficient field-sensitive pointer analysis of C

ACM Transactions on Programming Languages and Systems (TOPLAS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a parallel code generation algorithm for complete applications and a new experimental methodology that tests the efficacy of our approach. The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It also uses interprocedural analysis and transformations to improve the granularity of parallelism. Although the individual components of the algorithm have been published previously, their coordination is unique to this paper. For experimental validation, we do not attempt to parallelize “dusty deck” programs where many have tried and failed. Instead, we collect programs where the users tried to achieve excellent parallel performance. We apply our optimizations to sequential versions of these programs, i.e., the compiler was required to use its analysis and algorithms to parallelize the program and could not rely on user assertions that for example, a loop is parallel. With this metric, our algorithm improves or matches hand-coded parallel programs on shared-memory, bus-based parallel machines for eight of the nine programs in our test suite.