Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Authors:
Georgios Tournavitis;Zheng Wang;Björn Franke;Michael F.P. O'Boyle
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Year:
2009

Citing 34
Cited 36

Interprocedural dependence analysis and parallelization

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
A methodology for parallelizing programs for multicomputers and complex memory multiprocessors

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A bridging model for parallel computation

Communications of the ACM
Semantical interprocedural parallelization: an overview of the PIPS project

ICS '91 Proceedings of the 5th international conference on Supercomputing
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
HPFIT: a set of integrated tools for the parallelization of applications using High Performance Fortran. PART I: HPFIT and the TransTOOL environment

Parallel Computing - Special double issue on environment and tools for parallel scientific computing
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Threaded multiple path execution

Proceedings of the 25th annual international symposium on Computer architecture
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
SUIF Explorer: an interactive and interprocedural parallelizer

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing for reduced code space using genetic algorithms

Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
The parallel execution of DO loops

Communications of the ACM
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Compile Time Barrier Synchronization Minimization

IEEE Transactions on Parallel and Distributed Systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Interactive Parallel Programming using the ParaScope Editor

IEEE Transactions on Parallel and Distributed Systems
Dynamic Dependence Analysis: A Novel Method for Data Depndence Evaluation

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
A Machine Learning Approach to Automatic Production of Compiler Heuristics

AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
The Jrpm system for dynamically parallelizing Java programs

Proceedings of the 30th annual international symposium on Computer architecture
Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Evaluating heuristics in automatically mapping multi-loop applications to FPGAs

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Performance-Driven Processor Allocation

IEEE Transactions on Parallel and Distributed Systems
Development and Implementation of an Interactive Parallelization Assistance Tool for OpenMP: iPat/OMP

IEICE - Transactions on Information and Systems
X10: concurrent programming for modern architectures

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimistic parallelism requires abstractions

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Software behavior oriented parallelization

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Sensitivity analysis for automatic parallelization on multi-cores

Proceedings of the 21st annual international conference on Supercomputing
Revisiting the Sequential Programming Model for Multi-Core

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Hierarchically tiled arrays for parallelism and locality

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
The Paralax infrastructure: automatic parallelization with a helping hand

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Estimating and exploiting potential parallelism by source-level dependence profiling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
MPSoC programming using the MAPS compiler

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
McFLAT: a profile-based framework for MATLAB loop analysis and transformations

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Brainy: effective selection of data structures

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
ALTER: exploiting breakable dependences for parallelization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
AARTS: low overhead online adaptive auto-tuning

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Performance analysis and tuning of automatically parallelized OpenMP applications

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatically tuning parallel and parallelized programs

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic parallelization of recursive functions using quantifier elimination

FLOPS'10 Proceedings of the 10th international conference on Functional and Logic Programming
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
A model-driven approach for software parallelization

MODELS'11 Proceedings of the 2011th international conference on Models in Software Engineering
Dynamic trace-based analysis of vectorization potential of applications

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
HELIX: automatic parallelization of irregular programs for chip multiprocessing

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Using graph-based program characterization for predictive modeling

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Fast loop-level data dependence profiling

Proceedings of the 26th ACM international conference on Supercomputing
Multi-slicing: a compiler-supported parallel approach to data dependence profiling

Proceedings of the 2012 International Symposium on Software Testing and Analysis
Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Portable section-level tuning of compiler parallelized applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dynamic thread mapping based on machine learning for transactional memory applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Dancing with uncertainty

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
General data structure expansion for multi-threading

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Parallelizing Sequential Programs with Statistical Accuracy Tests

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Probabilistic Embedded Computing
Tightfit: adaptive parallelization with foresight

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
Do inputs matter?: using data-dependence profiling to evaluate thread level speculation in BG/Q

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

ACM Transactions on Architecture and Code Optimization (TACO)
On Expressing Strategies for Directive-Driven Multicore Programing Models

Proceedings of Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms
Compiling Scilab to high performance embedded multicore systems

Microprocessors & Microsystems
Integrating profile-driven parallelism detection and machine-learning-based mapping

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures. We have evaluated our parallelization strategy against the NAS and SPEC OMP benchmarks and two different multi-core platforms (dual quad-core Intel Xeon SMP and dual-socket QS20 Cell blade). We demonstrate that our approach not only yields significant improvements when compared with state-of-the-art parallelizing compilers, but comes close to and sometimes exceeds the performance of manually parallelized codes. On average, our methodology achieves 96% of the performance of the hand-tuned OpenMP NAS and SPEC parallel benchmarks on the Intel Xeon platform and gains a significant speedup for the IBM Cell platform, demonstrating the potential of profile-guided and machine-learning based parallelization for complex multi-core platforms.