Parallel iteration of high-order Runge-Kutta methods with stepsize control
Journal of Computational and Applied Mathematics
Parallel and sequential methods for ordinary differential equations
Parallel and sequential methods for ordinary differential equations
Optimized extrapolation methods for parallel solution of IVPs on different computer architectures
Applied Mathematics and Computation
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
Adaptive Loop Tiling for a Multi-cluster CMP
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Parameter optimization for explicit parallel peer two-step methods
Applied Numerical Mathematics
Compilers: Principles, Techniques, & Tools with Gradiance
Compilers: Principles, Techniques, & Tools with Gradiance
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Locality optimized shared-memory implementations of iterated runge-kutta methods
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Adaptive parallel tiled code generation and accelerated auto-tuning
International Journal of High Performance Computing Applications
Hi-index | 7.29 |
Finding an efficient implementation variant for the numerical solution of problems from computational science and engineering involves many implementation decisions that are strongly influenced by the specific hardware architecture. The complexity of these architectures makes it difficult to find the best implementation variant by manual tuning. For numerical solution methods from linear algebra, auto-tuning techniques based on a global search engine as they are used for ATLAS or FFTW can be used successfully. These techniques generate different implementation variants at installation time and select one of these implementation variants either at installation time or at runtime, before the computation starts. For some numerical methods, auto-tuning at installation time cannot be applied directly, since the best implementation variant may strongly depend on the specific numerical problem to be solved. An example is solution methods for initial value problems (IVPs) of ordinary differential equations (ODEs), where the coupling structure of the ODE system to be solved has a large influence on the efficient use of the memory hierarchy of the hardware architecture. In this context, it is important to use auto-tuning techniques at runtime, which is possible because of the time-stepping nature of ODE solvers. In this article, we present a sequential self-adaptive ODE solver that selects the best implementation variant from a candidate pool at runtime during the first time steps, i.e., the auto-tuning phase already contributes to the progress of the computation. The implementation variants differ in the loop structure and the data structures used to realize the numerical algorithm, a predictor-corrector (PC) iteration scheme with Runge-Kutta (RK) corrector considered here as an example. For those implementation variants in the candidate pool that use loop tiling to exploit the memory hierarchy of a given hardware platform we investigate the selection of tile sizes. The self-adaptive ODE solver combines empirical search with a model-based approach in order to reduce the search space of possible tile sizes. Runtime experiments demonstrate the efficiency of the self-adaptive solver for different IVPs across a range of problem sizes and on different hardware architectures.