Global communication analysis and optimization

Authors:
Soumen Chakrabarti;Manish Gupta;Jong-Deok Choi
Affiliations:
Computer Science Division, U. C. Berkeley, CA;IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY;IBM T.J. Watson Research Center, Yorktown Heights, P.O. Box, 704, NY
Venue:
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Year:
1996

Citing 25
Cited 32

Data dependence and its application to parallel processing

International Journal of Parallel Programming
An overview of the PTRAN analysis system for multiprocessing

Proceedings of the 1st International Conference on Supercomputing
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Detecting redundant accesses to array data

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
PTRAN—the IBM parallel translation system

Parallel functional languages and compilers
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Lazy code motion

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A methodology for high-level synthesis of communication on multicomputers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
GIVE-N-TAKE—a balanced code placement framework

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Global code motion/global value numbering

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Interprocedural partial redundancy elimination and its application to distributed memory compilation

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The SP2 high-performance switch

IBM Systems Journal
The communication software and parallel environment of the IBM SP2

IBM Systems Journal
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
An HPF compiler for the IBM SP2

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Compiler reduction of synchronisation in shared virtual memory systems

ICS '95 Proceedings of the 9th international conference on Supercomputing
Static analysis to reduce synchronization costs in data-parallel programs

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A Unified Framework for Optimizing Communication in Data-Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
On the Efficient Engineering of Ambitious Program Analysis

IEEE Transactions on Software Engineering
Combining dependence and data-flow analyses to optimize communication

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
A Compilation Approach for Fortran 90D/ HPF Compilers

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing

Communication optimizations for parallel C programs

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Problem and machine sensitive communication optimization

ICS '98 Proceedings of the 12th international conference on Supercomputing
A global communication optimization technique based on data-flow analysis and linear algebra

ACM Transactions on Programming Languages and Systems (TOPLAS)
Minimizing Data and Synchronization Costs in One-Way Communication

IEEE Transactions on Parallel and Distributed Systems
Global optimization techniques for automatic parallelization of hybrid applications

ICS '01 Proceedings of the 15th international conference on Supercomputing
Static Single Assignment Form for Message-Passing Programs

International Journal of Parallel Programming
A framework for global communication analysis of optimizations

Compiler optimizations for scalable parallel systems
Advanced code generation for high performance Fortran

Compiler optimizations for scalable parallel systems
High performance Fortran compilation techniques for parallelizing scientific codes

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Algorithms for Supporting Compiled Communication

IEEE Transactions on Parallel and Distributed Systems
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Accurate Data and Context Management in Message-Passing Programs

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Gilgamesh: a multithreaded processor-in-memory architecture for petaflops computing

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the Earth Simulator

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Aggressive communication optimizations for clusters of workstations

Cluster computing
CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Evaluating support for global address space languages on the Cray X1

Proceedings of the 18th annual international conference on Supercomputing
A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems

IEEE Transactions on Parallel and Distributed Systems
Effective communication coalescing for data-parallel applications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Compiler-directed channel allocation for saving power in on-chip networks

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Towards a more efficient implementation of OpenMP for clusters via translation to global arrays

Parallel Computing - OpenMp
Shared memory programming for large scale machines

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-CFD-NOW: A pre-compiler for effectively parallelizing CFD applications on networks of workstations

The Journal of Supercomputing
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
Communication optimizations for global multi-threaded instruction scheduling

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Performance portable optimizations for loops containing communication operations

Proceedings of the 22nd annual international conference on Supercomputing
Slicing based code parallelization for minimizing inter-processor communication

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
A heuristic rule of partitioning irregular loop for parallelizing compilers

HPCA'09 Proceedings of the Second international conference on High Performance Computing and Applications
Efficient implementation of OpenMP for clusters with implicit data distribution

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performance Fortran. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. We present performance results for the IBM SP2 and a network of Sparc workstations (NOW) connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.