Chunking parallel loops in the presence of synchronization

Authors:
Jun Shirako;Jisheng M. Zhao;V. Krishna Nandivada;Vivek N. Sarkar
Affiliations:
Rice University, Houston, TX, USA;Rice University, Houston, TX, USA;IBM India Research Laboratory, Bangalore, India;Rice University, Houston, TX, USA
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 16
Cited 9

Allocating Independent Subtasks on Parallel Processors

IEEE Transactions on Software Engineering
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Synchronization using counting semaphores

ICS '88 Proceedings of the 2nd international conference on Supercomputing
The fuzzy barrier: a mechanism for high speed synchronization of processors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Fortran 90 explained

Fortran 90 explained
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization transformations for parallel computing

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Statistically rigorous java performance evaluation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Proceedings of the 22nd annual international conference on Supercomputing
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Reducing task creation and termination overhead in explicitly parallel programs

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Comparing the usability of library vs. language approaches to task parallelism

Evaluation and Usability of Programming Languages and Tools
Unifying barrier and point-to-point synchronization in OpenMP with phasers

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Intermediate language extensions for parallelism

Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Speculative optimizations for parallel programs on multicores

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A Transformation Framework for Optimizing Task-Parallel Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving GPGPU concurrency with elastic kernels

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage programmers to express the ideal parallelism in an application at a fine granularity that is natural for the underlying domain, while delegating to the compiler and runtime system the job of extracting coarser-grained useful parallelism for a given target system. A simple and important example of this separation of concerns between ideal and useful parallelism can be found in chunking of parallel loops, where the programmer expresses ideal parallelism by declaring all iterations of a loop to be parallel and the implementation exploits useful parallelism by executing iterations of the loop in sequential chunks. Though chunking of parallel loops has been used as a standard transformation for several years, it poses some interesting challenges when the parallel loop may directly or indirectly (via procedure calls) perform synchronization operations such as barrier, signal or wait statements. In such cases, a straightforward transformation that attempts to execute a chunk of loops in sequence in a single thread may violate the semantics of the original parallel program. In this paper, we address the problem of chunking parallel loops that may contain synchronization operations. We present a transformation framework that uses a combination of transformations from past work (e.g., loop strip-mining, interchange, distribution, unswitching) to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program. These transformations result in reduced synchronization and scheduling overheads, thereby improving performance and scalability. Our experimental results for 11 benchmark programs on an UltraSPARC II multicore processor showed a geometric mean speedup of 0.52x for the unchunked case and 9.59x for automatic chunking using the techniques described in this paper. This wide gap underscores the importance of using these techniques in future compiler and runtime systems for programming models with lightweight parallelism.