Techniques for efficient placement of synchronization primitives

Authors:
Alexandru Nicolau;Guangqiang Li;Arun Kejariwal
Affiliations:
University of California at Irvine, Irvine, California, USA;University of California at Irvine, Irvine, California, USA;Yahoo! Inc, Santa Clara, California, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 35
Cited 4

Efficient hardware for multiway jumps and pre-fetches

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Compiler algorithms for synchronization

IEEE Transactions on Computers
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Automatic generation of DAG parallelism

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Synchronization Algorithms for Shared-Memory Multiprocessors

Computer
Introduction to algorithms

Introduction to algorithms
A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture

Selected papers of the second workshop on Languages and compilers for parallel computing
Instruction reordering for fork-join parallelism

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Compiler algorithms for event variable synchronization

ICS '91 Proceedings of the 5th international conference on Supercomputing
Functional parallelism: theoretical foundations and implementation

Functional parallelism: theoretical foundations and implementation
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Optimal code motion: theory and practice

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A hierarchical approach to instruction-level parallelization

International Journal of Parallel Programming
Barrier inference

POPL '98 Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Advanced compiler design and implementation

Advanced compiler design and implementation
Cost-optimal code motion

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compositional pointer and escape analysis for Java programs

Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Global optimization by suppression of partial redundancies

Communications of the ACM
Pointer and escape analysis for multithreaded programs

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Dependence Analysis

Dependence Analysis
Structure of Computers and Computations

Structure of Computers and Computations
GTS: Extracting Full Parallelism Out of DO Loops

PARLE '89 Proceedings of the Parallel Architectures and Languages Europe, Volume II: Parallel Languages
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Fast Synchronization on Scalable Cache-Coherent Multiprocessors using Hybrid Primitives

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Percolation Scheduling: A Parallel Compilation Technique

Percolation Scheduling: A Parallel Compilation Technique
Mechanisms for efficient shared-memory, lock-based synchronization

Mechanisms for efficient shared-memory, lock-based synchronization
Thin locks: featherweight Synchronization for Java

ACM SIGPLAN Notices - Best of PLDI 1979-1999
On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

Proceedings of the 20th annual international conference on Supercomputing
Lightweight lock-free synchronization methods for multithreading

Proceedings of the 20th annual international conference on Supercomputing
Evaluating synchronization techniques for light-weight multithreaded/multicore architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Trace Scheduling: A Technique for Global Microcode Compaction

IEEE Transactions on Computers
Dynamic recognition of synchronization operations for improved data race detection

ISSTA '08 Proceedings of the 2008 international symposium on Software testing and analysis

Synchronization optimizations for efficient execution on multi-cores

Proceedings of the 23rd international conference on Supercomputing
How many threads to spawn during program multithreading?

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
HELIX: automatic parallelization of irregular programs for chip multiprocessing

Proceedings of the Tenth International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Harnessing the hardware parallelism of the emerging multi-cores systems necessitates concurrent software. Unfortunately, most of the existing mainstream software is sequential in nature. Although one could auto-parallelize a given program, the efficacy of this is largely limited to floating-point codes. One of the ways to alleviate the above limitation is to parallelize programs, which cannot be auto-parallelized, via explicit synchronization. In this regard, efficient placement of the synchronization primitives - say, post, wait - plays a key role in achieving high degree of thread-level parallelism (TLP). In this paper, we propose novel compiler techniques for the above. Specifically, given a control flow graph (CFG), the proposed techniques place a post as early as possible and place a wait as late as possible in the CFG, subject to dependences. We demonstrate the efficacy of our techniques, on a real machine, using real codes, specifically, from the industry-standard SPEC CPU benchmarks, the Linux kernel and other widely used open source codes. Our results show that the proposed techniques yield significantly higher levels of TLP than the state-of-the-art.