Allocating Independent Subtasks on Parallel Processors
IEEE Transactions on Software Engineering
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Data dependence and its application to parallel processing
International Journal of Parallel Programming
Synchronization using counting semaphores
ICS '88 Proceedings of the 2nd international conference on Supercomputing
The fuzzy barrier: a mechanism for high speed synchronization of processors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A compiler-assisted approach to SPMD execution
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Concurrency analysis in the presence of procedures using a data-flow framework
TAV4 Proceedings of the symposium on Testing, analysis, and verification
The NAS parallel benchmarks—summary and preliminary results
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Fortran 90 explained
Communication optimization and code generation for distributed memory machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiler optimizations for eliminating barrier synchronization
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient detection of determinacy races in Cilk programs
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Synchronization transformations for parallel computing
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Advanced compiler design and implementation
Advanced compiler design and implementation
Location Consistency-A New Memory Model and Cache Consistency Protocol
IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis
ECOOP '95 Proceedings of the 9th European Conference on Object-Oriented Programming
Efficient Dependence Analysis for Java Arrays
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Soot - a Java bytecode optimization framework
CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
Java Concurrency in Practice
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
May-happen-in-parallel analysis of X10 programs
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Statistically rigorous java performance evaluation
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
IEEE Transactions on Computers
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
Proceedings of the 22nd annual international conference on Supercomputing
Efficient, portable implementation of asynchronous multi-place programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization optimizations for efficient execution on multi-cores
Proceedings of the 23rd international conference on Supercomputing
Chunking parallel loops in the presence of synchronization
Proceedings of the 23rd international conference on Supercomputing
Work-first and help-first scheduling policies for async-finish task parallelism
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Polyglot: an extensible compiler framework for Java
CC'03 Proceedings of the 12th international conference on Compiler construction
Scaling Java points-to analysis using SPARK
CC'03 Proceedings of the 12th international conference on Compiler construction
Reducing task creation and termination overhead in explicitly parallel programs
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Intermediate language extensions for parallelism
Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Purity and side effect analysis for java programs
VMCAI'05 Proceedings of the 6th international conference on Verification, Model Checking, and Abstract Interpretation
Unrolling loops containing task parallelism
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Barrier elimination based on access dependency analysis for OpenMP
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Isolation for nested task parallelism
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Hi-index | 0.00 |
Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity by allowing the programmer to express multiple levels of parallelism, it can also lead to performance degradation due to increased overheads. In this article, we introduce a transformation framework for optimizing task-parallel programs with a focus on task creation and task termination operations. These operations can appear explicitly in constructs such as async, finish in X10 and HJ, task, taskwait in OpenMP 3.0, and spawn, sync in Cilk, or implicitly in composite code statements such as foreach and ateach loops in X10, forall and foreach loops in HJ, and parallel loop in OpenMP. Our framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism. Broadly, our transformations cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. Finish-elimination removes redundant task termination operations, forall-coarsening replaces expensive task creation and termination operations with more efficient synchronization operations, and loop-chunking extracts useful parallelism from ideal parallelism. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached. Further, we discuss the impact of exception semantics on the specified transformations, and extend them to handle task-parallel programs with precise exception semantics. Experimental results were obtained for a collection of task-parallel benchmarks on three multicore platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-core Intel Xeon SMP, and a quad-socket 32-core Power7 SMP. We have observed that the proposed optimizations interact with each other in a synergistic way, and result in an overall geometric average performance improvement between 6.28× and 10.30×, measured across all three platforms for the benchmarks studied.