Scalable parallel AMG on ccNUMA machines with OpenMP

Authors:
Malte Förster;Jiri Kraus
Affiliations:
Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Schloss Birlinghoven, Sankt Augustin, Germany 53754;Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Schloss Birlinghoven, Sankt Augustin, Germany 53754
Venue:
Computer Science - Research and Development
Year:
2011

Citing 5
Cited 1

Exploiting memory affinity in OpenMP through schedule reuse

ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
C++ Templates

C++ Templates
Reducing Complexity in Parallel Algebraic Multigrid Preconditioners

SIAM Journal on Matrix Analysis and Applications
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Application Performance Tuning for Clusters with ccNUMA Nodes

CSE '08 Proceedings of the 2008 11th IEEE International Conference on Computational Science and Engineering

Efficient AMG on heterogeneous systems

Facing the Multicore-Challenge II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many numerical simulation codes the backbone of the application covers the solution of linear systems of equations. Often, being created via a discretization of differential equations, the corresponding matrices are very sparse. One popular way to solve these sparse linear systems are multigrid methods--in particular AMG--because of their numerical scalability. But looking at modern multi-core architectures, also the parallel scalability has to be taken into account. With the memory bandwidth usually being the bottleneck of sparse matrix operations these linear solvers can't always benefit from increasing numbers of cores. To exploit the available aggregated memory bandwidth on larger scale NUMA machines evenly distributed data is often more an issue than load balancing. Additionally, using a threading model like OpenMP, one has to ensure the data locality manually by explicit placement of memory pages. On non uniform data it is always a trade-off between these three principles, while the ideal strategy is strongly machine- and application dependent. In this paper we want to present some benchmarks of an AMG implementation based on a new performance library. Main focus is on the comparability to state-of-the-art solver packages regarding sequential performance as well as parallel scalability on common NUMA machines. To maximize throughput on standard model problems, several thread and memory configurations have been evaluated. We will show that even on large scale multi-core architectures easy parallel programming models, like OpenMP, can achieve a competitive performance compared to more complex programming models.