Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors

Authors:
Dan Wallin;Henrik Löf;Erik Hagersten;Sverker Holmgren
Affiliations:
Uppsala University, Uppsala, SWEDEN;Uppsala University, Uppsala, SWEDEN;Uppsala University, Uppsala, SWEDEN;Uppsala University, Uppsala, SWEDEN
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 8
Cited 4

A parallel Gauss-Seidel method using NR data flow ordering

Applied Mathematics and Computation
Memory characteristics of iterative methods

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Simics: A Full System Simulation Platform

Computer
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
StatCache: a probabilistic approach to efficient and accurate data locality analysis

ISPASS '04 Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software

Distributed gradient-domain processing of planar and spherical images

ACM Transactions on Graphics (TOG)
On the performance of an algebraic multigrid solver on multicore clusters

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
An Efficient Parallel Implementation for Three-Dimensional Incompressible Pipe Flow Based on SIMPLE

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient solution of partial differential equations require a match between the algorithm and the target architecture. Many recent chip multiprocessors, CMPs (a.k.a. multi-core), feature low intra-thread communication costs and smaller per-thread caches compared to previous shared memory multi-processor systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. A fact that may require a re-evaluation of many existing algorithms.We have investigated parallel implementations of multi-grid methods using a parallel temporally blocked, naturally ordered smoother. Compared to the standard multigrid solution based on a red-black ordering, we improve the data locality often as much as ten times, while our use of a fine-grained locking scheme keeps the parallel efficiency high.Our algorithm was initially inspired by CMPs and it was surprising to see that our OpenMP multigrid implementation ran up to 40 percent faster than the standard red-black algorithm on a contemporary 8-way SMP system. Thanks to the temporal blocking introduced, our smoother implementation often allowed us to apply the smoother two times at the same cost as a single application of a red-black smoother. By executing our smoother on a 32-thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP we demonstrate that such architectures can tolerate the increased communication costs implied by the tradeoffs made in our implementation.