HELIX: automatic parallelization of irregular programs for chip multiprocessing

Authors:
Simone Campanoni;Timothy Jones;Glenn Holloway;Vijay Janapa Reddi;Gu-Yeon Wei;David Brooks
Affiliations:
Harvard University, Cambridge;University of Cambridge, Cambridge, UK;Harvard University, Cambridge;The University of Texas at Austin, Austin;Harvard University, Cambridge;Harvard University, Cambridge
Venue:
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Year:
2012

Citing 32
Cited 4

Reevaluating Amdahl's law

Communications of the ACM
Efficient Doacross execution on distributed shared-memory multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Spin-block synchronization algorithm in the shared memory multiprocessor system

ACM SIGOPS Operating Systems Review
On Effective Execution of Nonuniform DOACROSS Loops

IEEE Transactions on Parallel and Distributed Systems
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Redundant Synchronization Elimination for DOACROSS Loops

IEEE Transactions on Parallel and Distributed Systems
Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

IEEE Transactions on Parallel and Distributed Systems
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Modern Compiler Implementation in Java

Modern Compiler Implementation in Java
Shared Memory Consistency Models: A Tutorial

Computer
Perfect Pipelining: A New Loop Parallelization Technique

ESOP '88 Proceedings of the 2nd European Symposium on Programming
Speculative Parallelization of Partially Parallel Loops

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Exposing speculative thread parallelism in SPEC2000

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Statement Re-ordering for DOACROSS Loops

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Speculative Decoupled Software Pipelining

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Revisiting the Sequential Programming Model for the Multicore Era

IEEE Micro
Parallel-stage decoupled software pipelining

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Performance scalability of decoupled software pipelining

ACM Transactions on Architecture and Code Optimization (TACO)
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Techniques for efficient placement of synchronization primitives

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Synchronization optimizations for efficient execution on multi-cores

Proceedings of the 23rd international conference on Supercomputing
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Optimal loop parallelization for maximizing iteration-level parallelism

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Exploiting Parallelism with Dependence-Aware Scheduling

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
A highly flexible, parallel virtual machine: design and experience of ILDJIT

Software—Practice & Experience
Speculative parallelization using software multi-threaded transactions

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Scalable Speculative Parallelization on Commodity Clusters

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

The HELIX project: overview and directions

Proceedings of the 49th Annual Design Automation Conference
From sequential programming to flexible parallel execution

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Interprocedural strength reduction of critical sections in explicitly-parallel programs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Accelerating sequential programs on commodity multi-core processors

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel® Core i7-980X, HELIX achieves speedups averaging 2.25 x, with a maximum of 4.12x, for thirteen C benchmarks from SPEC CPU2000.