High performance Fortran compilation techniques for parallelizing scientific codes

Authors:
Vikram Adve;Guohua Jin;John Mellor-Crummey;Qing Yi
Affiliations:
Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX
Venue:
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Year:
1998

Citing 9
Cited 18

The high performance Fortran handbook

The high performance Fortran handbook
Preliminary experiences with the Fortran D compiler

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A scalable implementation of the NAS Parallel Benchmark BT on distributed memory systems

IBM Systems Journal
Performance measurement, visualization and modeling of parallel and distributed programs using the AIMS toolkit

Software—Practice & Experience
Computer aided parallelisation tools (CAPTools)—conceptual overview and performance on the parallelisation of structured mesh codes

Parallel Computing
Global communication analysis and optimization

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Using integer sets for data-parallel program analysis and optimization

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The Paradigm Compiler for Distributed-Memory Multicomputers

Computer
Program Analysis of Overlap Area Usage in Self-Similar Parallel Programs

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing

A comparison of automatic parallelization tools/compilers on the SGI origin 2000

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatic data and computation decomposition on distributed memory parallel computers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Compiler Synthesis of Task Graphs for Parallel Program Performance Prediction

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Data-Parallel Compiler Support for Multipartitioning

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Compilation and Runtime-Optimizations for Software Distributed Shared Memory

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Toward Compiler Support for Scalable Parallelism Using Multipartitioning

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Generalized Multipartitioning for Multi-Dimensional Arrays

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
TEST: a tracer for extracting speculative threads

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Jrpm system for dynamically parallelizing Java programs

Proceedings of the 30th annual international symposium on Computer architecture
Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
COTS Clusters vs. the Earth Simulator: An Application Study Using IMPACT-3D

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Parallelization of the NAS Conjugate Gradient Benchmark Using the Global Arrays Shared Memory Programming Model

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 4 - Volume 05
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Transparent runtime parallelization of the R scripting language

Journal of Parallel and Distributed Computing
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With current compilers for High Performance Fortran (HPF), substantial restructuring and hand-optimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can provide consistently high performance for a broad spectrum of scientific applications with minimal restructuring of existing Fortran 77 or Fortran 90 applications. This paper presents four new optimization techniques we developed to support efficient parallelization of codes with minimal restructuring. These optimizations include computation partition selection for loop nests that use privatizable arrays, along with partial replication of boundary computations to reduce communication overhead; communication-sensitive loop distribution to eliminate inner-loop communications; interprocedural selection of computation partitions; and data availability analysis to eliminate redundant communications. We studied the effectiveness of the dHPF compiler, which incorporates these optimizations, in parallelizing serial versions of the NAS SP and BT application benchmarks. We present experimental results comparing the performance of hand-written MPI code for the benchmarks against code generated from HPF using the dHPF compiler and the Portland Group's pghpf compiler. Using the compilation techniques described in this paper we achieve performance within 15% of hand-written MPI code on 25 processors for BT and within 33% for SP. Furthermore, these results are obtained with HPF versions of the benchmarks that were created with minimal restructuring of the serial code (modifying only approximately 5% of the code).