Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Authors:
Matthias Korch;Thomas Rauber
Affiliations:
Department of Mathematics, Physics, and Computer Science, University of Bayreuth, Germany;Department of Mathematics, Physics, and Computer Science, University of Bayreuth, Germany
Venue:
Journal of Parallel and Distributed Computing
Year:
2006

Citing 33
Cited 11

Parallel iteration of high-order Runge-Kutta methods with stepsize control

Journal of Computational and Applied Mathematics
Parallel diagonally implicit Runge-Kutta-Nystro¨m methods

Applied Numerical Mathematics
Parallelism across time in ODEs

Applied Numerical Mathematics - Special issue: parallel methods for ordinary differential equations
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Parallel and sequential methods for ordinary differential equations

Parallel and sequential methods for ordinary differential equations
Early prediction of MPP performance: the SP2, T3D, and Paragon experiences

Parallel Computing
Optimized extrapolation methods for parallel solution of IVPs on different computer architectures

Applied Mathematics and Computation
Triangularly Implicit Iteration Methods for ODE-IVP Solvers

SIAM Journal on Scientific Computing
Parallel ODE solvers

ICS '90 Proceedings of the 4th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Programming with POSIX threads

Programming with POSIX threads
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Parallel Adams methods

Journal of Computational and Applied Mathematics
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines

Journal of Parallel and Distributed Computing
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
The working set model for program behavior

Communications of the ACM
Performance optimization of numerically intensive codes

Performance optimization of numerically intensive codes
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Simics: A Full System Simulation Platform

Computer
Near-Optimal Loop Tiling by Means of Cache Miss Equations and Genetic Algorithms

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Optimal Schwarz Waveform Relaxation for the One Dimensional Wave Equation

SIAM Journal on Numerical Analysis
Parallel Two-Step W-Methods with Peer Variables

SIAM Journal on Numerical Analysis
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance optimization of RK methods using block-based pipelining

Performance analysis and grid computing
Restructuring computations for temporal data cache locality

International Journal of Parallel Programming
Improving locality for ODE solvers by program transformations

Scientific Programming
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
The Waveform Relaxation Method for Time-Domain Analysis of Large Scale Integrated Circuits

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

BlockLib: a skeleton library for cell broadband engine

Proceedings of the 1st international workshop on Multicore software engineering
Automatic parallelization of simulation code for equation-based models with software pipelining and measurements on three platforms

ACM SIGARCH Computer Architecture News
Parallel Implementation of Runge---Kutta Integrators with Low Storage Requirements

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
NestStepModelica: mathematical modeling and bulk-synchronous parallel simulation

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
SkePU: a multi-backend skeleton programming library for multi-GPU systems

Proceedings of the fourth international workshop on High-level parallel programming and applications
Parallel Low-Storage Runge-Kutta Solvers for ODE Systems with Limited Access Distance

International Journal of High Performance Computing Applications
Applicability of load balancing strategies to data-parallel embedded runge-kutta integrators

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Optimized composition of performance-aware parallel components

Concurrency and Computation: Practice & Experience
Locality optimized shared-memory implementations of iterated runge-kutta methods

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Automatic parallelization of object oriented models executed with inline solvers

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Benefits of using parallelized non-progressive network coding

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge-Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers.