A methodology for parallelizing programs for multicomputers and complex memory multiprocessors
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A bridging model for parallel computation
Communications of the ACM
A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Neural networks for pattern recognition
Neural networks for pattern recognition
LogP: a practical model of parallel computation
Communications of the ACM
Optimizing for reduced code space using genetic algorithms
Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems
Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment
Journal of the ACM (JACM)
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys (CSUR)
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Compiler and Runtime Support for Running OpenMP Programs on Pentium- and Itanium-Architectures
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Performance Evaluation of CP List Scheduling Heuristics for Communication Intensive Task Graphs
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Cross-architecture performance predictions for scientific applications using parameterized models
Proceedings of the joint international conference on Measurement and modeling of computer systems
Runtime Empirical Selection of Loop Schedulers on Hyperthreaded SMPs
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Performance-Driven Processor Allocation
IEEE Transactions on Parallel and Distributed Systems
A case study in top-down performance estimation for a large-scale parallel application
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Rapidly Selecting Good Compiler Optimizations using Performance Counters
Proceedings of the International Symposium on Code Generation and Optimization
Pipa: pipelined profiling and analysis on multi-core systems
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell BE
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
A cost-aware parallel workload allocation approach based on machine learning techniques
NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
An approach to performance prediction for parallel applications
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Partitioning streaming parallelism for multi-cores: a machine learning based approach
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Design principles for end-to-end multicore schedulers
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Adaptive line size cache for irregular references on cell multicore processor
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A workload-aware mapping approach for data-parallel programs
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Parallelism orchestration using DoPE: the degree of parallelism executive
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Brainy: effective selection of data structures
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Applying autonomic principles for workload management in multi-core systems on chip
Proceedings of the 8th ACM international conference on Autonomic computing
Performance analysis and tuning of automatically parallelized OpenMP applications
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatically tuning parallel and parallelized programs
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Autonomic workload management for multi-core processor systems
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Parcae: a system for flexible parallel execution
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatic OpenMP loop scheduling: a combined compiler and runtime approach
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Exploiting inter-sequence correlations for program behavior prediction
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Portable section-level tuning of compiler parallelized applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Finding good optimization sequences covering program space
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Dynamic thread mapping based on machine learning for transactional memory applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A pattern-supported parallelization approach
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Automatic generation of program affinity policies using machine learning
CC'13 Proceedings of the 22nd international conference on Compiler Construction
Adaptive parallelism for web search
Proceedings of the 8th ACM European Conference on Computer Systems
Using machine learning to partition streaming programs
ACM Transactions on Architecture and Code Optimization (TACO)
Automatic feature generation for machine learning--based optimising compilation
ACM Transactions on Architecture and Code Optimization (TACO)
Integrating profile-driven parallelism detection and machine-learning-based mapping
ACM Transactions on Architecture and Code Optimization (TACO)
Machine Learning-Based Runtime Scheduler for Mobile Offloading Framework
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.