Multicolor ICCG methods for vector computers
SIAM Journal on Numerical Analysis
Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines
Advances in Engineering Software
Pracniques: further remarks on reducing truncation errors
Communications of the ACM
Generative Programming and Active Libraries
Selected Papers from the International Seminar on Generic Programming
A framework approach for developing parallel adaptive multiphysics applications
Finite Elements in Analysis and Design - Special issue: The fifteenth annual Robert J. Melosh competition
libMesh: a C++ library for parallel adaptive mesh refinement/coarsening simulations
Engineering with Computers
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Ypnos: declarative, parallel structured grid programming
Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Performance analysis of the OP2 framework on many-core architectures
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A novel shared-memory thread-pool implementation for hybrid parallel CFD solvers
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Liszt: a domain specific language for building portable mesh-based PDE solvers
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Unstructured mesh partition improvement for implicit finite element at extreme scale
The Journal of Supercomputing
Design and performance of the OP2 library for unstructured mesh applications
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Predictive modeling and analysis of OP2 on distributed memory GPU clusters
ACM SIGMETRICS Performance Evaluation Review
PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes
SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
An Analytical Study of Loop Tiling for a Large-Scale Unstructured Mesh Application
SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism
IPDPSW '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Designing OP2 for GPU architectures
Journal of Parallel and Distributed Computing
Vectorizing Unstructured Mesh Computations for Many-core Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2's recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.