Using machine learning to partition streaming programs

Authors:
Zheng Wang;Michael F. P. O'boyle
Affiliations:
University of Edinburgh, Edinburgh, UK;University of Edinburgh, Edinburgh, UK
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2008

Citing 32
Cited 1

Finding good approximate vertex and edge partitions is NP-hard

Information Processing Letters
Automatic partitioning of a program dependence graph into parallel tasks

IBM Journal of Research and Development
LogP: a practical model of parallel computation

Communications of the ACM
Learning to schedule straight-line code

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Meta optimization: improving compiler heuristics with machine learning

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Predicting Unroll Factors Using Supervised Classification

Proceedings of the international symposium on Code generation and optimization
Automatically partitioning packet processing applications for pipelined architectures

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dynamic Task Scheduling in Hard Real-Time Distributed systems

IEEE Software
Cole: compiler optimization level exploration

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Iterative optimization in the polyhedral model: part ii, multidimensional time

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Mapping parallelism to multi-cores: a machine learning based approach

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Analytical Modeling of Pipeline Parallelism

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Input-driven dynamic execution prediction of streaming applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Language and compiler support for stream programs

Language and compiler support for stream programs
Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An empirical characterization of stream programs and its implications for language and compiler design

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A workload-aware mapping approach for data-parallel programs

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

Integrating profile-driven parallelism detection and machine-learning-based mapping

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stream-based parallel languages are a popular way to express parallelism in modern applications. The efficient mapping of streaming parallelism to today's multicore systems is, however, highly dependent on the program and underlying architecture. We address this by developing a portable and automatic compiler-based approach to partitioning streaming programs using machine learning. Our technique predicts the ideal partition structure for a given streaming application using prior knowledge learned offline. Using the predictor we rapidly search the program space (without executing any code) to generate and select a good partition. We applied this technique to standard StreamIt applications and compared against existing approaches. On a 4-core platform, our approach achieves 60% of the best performance found by iteratively compiling and executing over 3000 different partitions per program. We obtain, on average, a 1.90× speedup over the already tuned partitioning scheme of the StreamIt compiler. When compared against a state-of-the-art analytical, model-based approach, we achieve, on average, a 1.77× performance improvement. By porting our approach to an 8-core platform, we are able to obtain 1.8× improvement over the StreamIt default scheme, demonstrating the portability of our approach.