Mesh independent loop fusion for unstructured mesh applications

Authors:
Carlo Bertolli;Adam Betts;Paul H.J. Kelly;Gihan R. Mudalige;Mike B. Giles
Affiliations:
Imperial College London, LONDON, United Kingdom;Imperial College London, LONDON, United Kingdom;Imperial College London, London, United Kingdom;University of Oxford, Oxford, United Kingdom;University of Oxford, Oxford, United Kingdom
Venue:
Proceedings of the 9th conference on Computing Frontiers
Year:
2012

Citing 11
Cited 1

A framework for generalized control dependence

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Supporting Timing Analysis by Automatic Bounding of LoopIterations

Real-Time Systems - Special issue on worst-case execution-time analysis
Compile-time composition of run-time data and iteration reorderings

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Program slicing

ICSE '81 Proceedings of the 5th international conference on Software engineering
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Guaranteed Loop Bound Identification from Program Traces for WCET

RTAS '09 Proceedings of the 2009 15th IEEE Symposium on Real-Time and Embedded Technology and Applications
Performance analysis of the OP2 framework on many-core architectures

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Liszt: a domain specific language for building portable mesh-based PDE solvers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Analysis and Optimization of the OP2 Framework on Many-Core Architectures

The Computer Journal
A distributed data-parallel framework for analysis and visualization algorithm development

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Design and performance of the OP2 library for unstructured mesh applications

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing

Using domain-specific languages and access-execute descriptors to expand the parallel code synthesis design space: keynote talk

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation. In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.