Run-time parallelization and scheduling of loops
SPAA '89 Proceedings of the first annual ACM symposium on Parallel algorithms and architectures
Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Is data distribution necessary in OpenMP?
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Optimizing compiler design for modularity and extensibility
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
TEST: a tracer for extracting speculative threads
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Jrpm system for dynamically parallelizing Java programs
Proceedings of the 30th annual international symposium on Computer architecture
Parallelism orchestration using DoPE: the degree of parallelism executive
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Adaptive parallelism for web search
Proceedings of the 8th ACM European Conference on Computer Systems
Hi-index | 0.00 |
This paper presents preliminary efforts to develop compilation and execution environments that achieve performance portability of multilevel parallelization on hierarchical architectures. Using the NAS parallel benchmarks, we first illustrate the lack of portable performance on state-of-the-art scalable parallel systems despite the use of two portable programming models, MPI and OpenMP. Then we present a dynamic compilation and execution framework that provides the desired portability through the use of program slices. These slices are used to select the optimal program decomposition on each architecture. Currently, our framework uses a simple incremental algorithm, which effectively identifies single or multi-level program decompositions that maximize performance. This algorithm can be used as a rule of thumb for automatic multilevel parallelization. The effectiveness of the approach is demonstrated on the NAS benchmarks running on two architectural platforms.