Reducing task creation and termination overhead in explicitly parallel programs

Authors:
Jisheng Zhao;Jun Shirako;V. Krishna Nandivada;Vivek Sarkar
Affiliations:
Rice University, Houston, TX, USA;Rice University, Houston, TX, USA;IBM, Bangalore, India;Rice University, Houston, TX, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 15
Cited 9

A compiler-assisted approach to SPMD execution

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiler optimizations for eliminating barrier synchronization

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization transformations for parallel computing

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Statistically rigorous java performance evaluation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Proceedings of the 22nd annual international conference on Supercomputing
Efficient, portable implementation of asynchronous multi-place programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization optimizations for efficient execution on multi-cores

Proceedings of the 23rd international conference on Supercomputing
Chunking parallel loops in the presence of synchronization

Proceedings of the 23rd international conference on Supercomputing
Principles of Parallel Programming

Principles of Parallel Programming
Unrolling loops containing task parallelism

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing

Pervasive parallelism for managed runtimes

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Unifying barrier and point-to-point synchronization in OpenMP with phasers

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Evaluating the performance and scalability of mapreduce applications on X10

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Intermediate language extensions for parallelism

Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
An infrastructure for dynamic optimization of parallel programs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Comparing parallel performance of Go and C++ TBB on a direct acyclic task graph using a dynamic programming problem

Proceedings of the 50th Annual Southeast Regional Conference
A Transformation Framework for Optimizing Task-Parallel Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Online feedback-directed optimizations for parallel Java code

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup 66.7x, 11.25x and 23.1x respectively on each platform and 4.6x, 2.1x and 6.4x performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.