Mapping parallelism to multi-cores: a machine learning based approach

Authors:
Zheng Wang;Michael F.P. O'Boyle
Affiliations:
The University of Edinburgh, Edinburgh, United Kingdom;The University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 23
Cited 25

A methodology for parallelizing programs for multicomputers and complex memory multiprocessors

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A bridging model for parallel computation

Communications of the ACM
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Neural networks for pattern recognition

Neural networks for pattern recognition
LogP: a practical model of parallel computation

Communications of the ACM
Optimizing for reduced code space using genetic algorithms

Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment

Journal of the ACM (JACM)
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Compiler and Runtime Support for Running OpenMP Programs on Pentium- and Itanium-Architectures

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Performance Evaluation of CP List Scheduling Heuristics for Communication Intensive Task Graphs

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Performance-Driven Processor Allocation

IEEE Transactions on Parallel and Distributed Systems
A case study in top-down performance estimation for a large-scale parallel application

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Rapidly Selecting Good Compiler Optimizations using Performance Counters

Proceedings of the International Symposium on Code Generation and Optimization
Pipa: pipelined profiling and analysis on multi-core systems

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A regression-based approach to scalability prediction

Proceedings of the 22nd annual international conference on Supercomputing
Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell BE

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
A cost-aware parallel workload allocation approach based on machine learning techniques

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
An approach to performance prediction for parallel applications

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Design principles for end-to-end multicore schedulers

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A workload-aware mapping approach for data-parallel programs

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Parallelism orchestration using DoPE: the degree of parallelism executive

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Brainy: effective selection of data structures

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Applying autonomic principles for workload management in multi-core systems on chip

Proceedings of the 8th ACM international conference on Autonomic computing
Performance analysis and tuning of automatically parallelized OpenMP applications

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatically tuning parallel and parallelized programs

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Autonomic workload management for multi-core processor systems

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Parcae: a system for flexible parallel execution

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatic OpenMP loop scheduling: a combined compiler and runtime approach

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Exploiting inter-sequence correlations for program behavior prediction

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Portable section-level tuning of compiler parallelized applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Finding good optimization sequences covering program space

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Dynamic thread mapping based on machine learning for transactional memory applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A pattern-supported parallelization approach

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Automatic generation of program affinity policies using machine learning

CC'13 Proceedings of the 22nd international conference on Compiler Construction
Adaptive parallelism for web search

Proceedings of the 8th ACM European Conference on Computer Systems
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic feature generation for machine learning--based optimising compilation

ACM Transactions on Architecture and Code Optimization (TACO)
Integrating profile-driven parallelism detection and machine-learning-based mapping

ACM Transactions on Architecture and Code Optimization (TACO)
Machine Learning-Based Runtime Scheduler for Mobile Offloading Framework

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.