PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Parallel Programming with Polaris
Computer
Reducing Parallel Overheads Through Dynamic Serialization
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
A Feasibility Study in Iterative Compilation
ISHPC '99 Proceedings of the Second International Symposium on High Performance Computing
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning
Proceedings of the International Symposium on Code Generation and Optimization
Fast, automatic, procedure-level performance tuning
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Mapping parallelism to multi-cores: a machine learning based approach
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Experiences in using cetus for source-to-source transformations
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
AARTS: low overhead online adaptive auto-tuning
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Performance analysis and tuning of automatically parallelized OpenMP applications
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Portable section-level tuning of compiler parallelized applications
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic optimization of stream programs via source program operator graph transformations
Distributed and Parallel Databases
Towards software performance engineering for multicore and manycore systems
ACM SIGMETRICS Performance Evaluation Review
Integrating profile-driven parallelism detection and machine-learning-based mapping
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
In today's multicore era, parallelization of serial code is essential in order to exploit the architectures' performance potential. Parallelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifications or towards analysis of computationally intensive sections of code for the best possible parallel performance, both of which are difficult and time-consuming. Automatic parallelization uses sophisticated compile-time techniques in order to identify parallelism in serial programs, thus reducing the burden on the program developer. Similar sophistication is needed to improve the performance of hand-parallelized programs. A key difficulty is that optimizing compilers are generally unable to estimate the performance of an application or even a program section at compile-time, and so the task of performance improvement invariably rests with the developer. Automatic tuning uses static analysis and runtime performance metrics to determine the best possible compile-time approach for optimal application performance. This paper describes an offline tuning approach that uses a source-to-source parallelizing compiler, Cetus, and a tuning framework to tune parallel application performance. The implementation uses an existing, generic tuning algorithm called Combined Elimination to study the effect of serializing parallelizable loops based on measured whole program execution time, and provides a combination of parallel loops as an outcome that ensures to equal or improve performance of the original program. We evaluated our algorithm on a suite of hand-parallelized C benchmarks from the SPEC OMP2001 and NAS Parallel benchmarks and provide two sets of results. The first ignores hand-parallelized loops and only tunes application performance based on Cetus-parallelized loops. The second set of results considers the tuning of additional parallelism in hand-parallelized code. We show that our implementation always performs near-equal or better than serial code while tuning only Cetus-parallelized loops and equal to or better than hand-parallelized code while tuning additional parallelism.