Exploiting domain knowledge to optimize parallel computational mechanics codes

Authors:
Chenyang Liu;Muhammad Hasan Jamal;Milind Kulkarni;Arun Prakash;Vijay Pai
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 16
Cited 1

LAPACK's user's guide

LAPACK's user's guide
Runtime compilation techniques for data partitioning and communication schedule reuse

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
A MATLAB to Fortran 90 translator and its effectiveness

ICS '96 Proceedings of the 10th international conference on Supercomputing
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A case for source-level transformations in MATLAB

Proceedings of the 2nd conference on Domain-specific languages
BILUTM: A Domain-Based Multilevel Block ILUT Preconditioner for General Sparse Matrices

SIAM Journal on Matrix Analysis and Applications
Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization

ICS '01 Proceedings of the 15th international conference on Supercomputing
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
MaJIC: compiling MATLAB for speed and responsiveness

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Rescheduling for Locality in Sparse Matrix Computations

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
An Automated Multilevel Substructuring Method for Eigenspace Computation in Linear Elastodynamics

SIAM Journal on Scientific Computing
Reusable, generic program analyses and transformations

GPCE '09 Proceedings of the eighth international conference on Generative programming and component engineering

Computationally efficient multi-time-step method for partitioned time integration of highly nonlinear structural dynamics

Computers and Structures

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important emerging problem domain in computational science and engineering is the development of multi-scale computational methods for complex problems in mechanics that span multiple spatial and temporal scales. An attractive approach to solving these problems is recursive decomposition: the problem is broken up into a tree of loosely coupled sub-problems which can be solved independently and then coupled back together to obtain the desired solution. However, a particular problem can be solved in myriad ways by coupling the sub-problems together in different tree orders. As we argue in this paper, the space of possible orders is vast, the performance gap between an arbitrary order and the best order is potentially quite large, and the likelihood that a domain scientist can find the best order to solve a problem on a particular machine is vanishingly small. In this paper, we present a system that uses domain-specific knowledge captured in computational libraries to optimize code written in a conventional language (C). The system generates efficient coupling orders to solve computational mechanics problems using recursive decomposition. Our system adopts the inspector-executor paradigm, where the problem is inspected and a novel heuristic finds an effective implementation based on domain properties evaluated by a cost model. The derived implementation is then executed by a parallel run-time system (Cilk) which achieves optimal parallel performance. We demonstrate that our cost model is highly correlated with actual application runtime, that our proposed technique outperforms non-decomposed and non-multiscale methods. The code generated by the heuristic also outperforms alternate scheduling strategies, as well as over 99% of randomly-generated recursive decompositions sampled from the space of possible solutions.