Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Prediction and adaptation in Active Harmony
Cluster Computing
Generating Parallel Programs from the Wavefront Design Pattern
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
A parallel wavefront algorithm for efficient biological sequence comparison
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartII
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Structured Parallel Programming: Patterns for Efficient Computation
Structured Parallel Programming: Patterns for Efficient Computation
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Autotuning Wavefront Abstractions for Heterogeneous Architectures
WAMCA '12 Proceedings of the 2012 Third Workshop on Applications for Multi-Core Architecture
Auto-tuning methodology to represent landform attributes on multicore and multi-GPU systems
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Portable mapping of data parallel programs to OpenCL for heterogeneous systems
CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Hi-index | 0.00 |
Manual tuning of applications for heterogeneous parallel systems is tedious and complex. Optimizations are often not portable, and the whole process must be repeated when moving to a new system, or sometimes even to a different problem size. Pattern-based programming models provide structure which can assist in the creation of autotuners for such problems. We present a machine learning based auto-tuning framework which partitions the work created by applications which follow the wavefront pattern across systems comprising multicore CPUs and multiple GPU accelerators. The use of a pattern facilitates training on synthetically generated instances. Exhaustive search space exploration on real applications indicates that correct setting of the tuning factors leads to a maximum of 20x speedup over an optimized sequential baseline, with an average of 7.8x. Our machine learned heuristics obtain 98% of this speed-up, averaged across range of applications and architectures.