Compiler optimizations for eliminating barrier synchronization

Authors:
Chau-Wen Tseng
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1995

Citing 24
Cited 53

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
Compiler algorithms for event variable synchronization

ICS '91 Proceedings of the 5th international conference on Supercomputing
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A compiler-assisted approach to SPMD execution

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Data-parallel programming on MIMD computers

Data-parallel programming on MIMD computers
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Relaxing SIMD control flow constraints using loop transformations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Evaluation of release consistent software distributed shared memory on emerging network technology

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler techniques for data synchronization in nested parallel loops

ICS '90 Proceedings of the 4th international conference on Supercomputing
The impact of synchronization and granularity on parallel systems

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Synchronization Issues in Data-Parallel Languages

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Polaris: Improving the Effectiveness of Parallelizing Compilers

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Static Analysis of Barrier Synchronization in Explicitly Parallel Programs

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Automatic Data Layout Using 0-1 Integer Programming

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Automatic synchronisation elimination in synchronous FORALLs

FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)

Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Detecting coarse-grain parallelism using an interprocedural parallelizing compiler

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
Commutativity analysis: a new analysis framework for parallelizing compilers

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Global communication analysis and optimization

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Static analysis to reduce synchronization costs in data-parallel programs

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Performance debugging shared memory parallel programs using run-time dependence analysis

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A graph based approach to barrier synchronisation minimisation

ICS '97 Proceedings of the 11th international conference on Supercomputing
Synchronization transformations for parallel computing

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Commutativity analysis: a new analysis technique for parallelizing compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Problem and machine sensitive communication optimization

ICS '98 Proceedings of the 12th international conference on Supercomputing
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Effective synchronization removal for Java

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Minimizing Data and Synchronization Costs in One-Way Communication

IEEE Transactions on Parallel and Distributed Systems
A synthesis of memory mechanisms for distributed architectures

ICS '01 Proceedings of the 15th international conference on Supercomputing
A framework for global communication analysis of optimizations

Compiler optimizations for scalable parallel systems
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
Pointer analysis for structured parallel programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs

International Journal of Parallel Programming
Is OpenMP for Grids?

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Towards OpenMP Execution on Software Distributed Shared Memory Systems

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Analysis of Multithreaded Programs

SAS '01 Proceedings of the 8th International Symposium on Static Analysis
Aggressive communication optimizations for clusters of workstations

Cluster computing
Compiler-generated communication for pipelined FPGA applications

Proceedings of the 40th annual Design Automation Conference
Adaptive Parallel System as the High Performance Parallel Architecture

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Compile-time Synchronization Optimizations for Software DSMs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors

Proceedings of the conference on Design, automation and test in Europe - Volume 2
Optimizing OpenMP programs on software distributed shared memory systems

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Evaluating heuristics in automatically mapping multi-loop applications to FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
A linear-time algorithm for optimal barrier placement

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Shared memory multiprocessor support for functional array processing in SAC

Journal of Functional Programming
Interprocedural parallelization analysis in SUIF

ACM Transactions on Programming Languages and Systems (TOPLAS)
Auto-CFD-NOW: A pre-compiler for effectively parallelizing CFD applications on networks of workstations

The Journal of Supercomputing
Barrier matching for programs with textually unaligned barriers

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler optimization techniques for OpenMP programs

Scientific Programming
Software-directed combined cpu/link voltage scaling fornoc-based cmps

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization optimizations for efficient execution on multi-cores

Proceedings of the 23rd international conference on Supercomputing
Chunking parallel loops in the presence of synchronization

Proceedings of the 23rd international conference on Supercomputing
Compiling for a hybrid programming model using the LMAD representation

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Reducing task creation and termination overhead in explicitly parallel programs

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Evaluation of a virtual shared memory machine by the compilation of data parallel loops

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Probabilistic analysis of time reduction by eliminating barriers in parallel programmes

International Journal of Communication Networks and Distributed Systems
Improving energy efficiency of multi-threaded applications using heterogeneous CMOS-TFET multicores

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Unifying barrier and point-to-point synchronization in OpenMP with phasers

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Array replication to increase parallelism in applications mapped to configurable architectures

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Enforcing textual alignment of collectives using dynamic checks

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Barrier elimination based on access dependency analysis for OpenMP

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
A Transformation Framework for Optimizing Task-Parallel Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the single-program, multiple data (SPMD) model. By exploiting compile-time computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on average and by several orders of magnitude for certain programs.