A parallel Gauss-Seidel method using NR data flow ordering
Applied Mathematics and Computation
Memory characteristics of iterative methods
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Cache-Efficient Multigrid Algorithms
International Journal of High Performance Computing Applications
StatCache: a probabilistic approach to efficient and accurate data locality analysis
ISPASS '04 Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software
Distributed gradient-domain processing of planar and spherical images
ACM Transactions on Graphics (TOG)
On the performance of an algebraic multigrid solver on multicore clusters
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
An Efficient Parallel Implementation for Three-Dimensional Incompressible Pipe Flow Based on SIMPLE
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hi-index | 0.00 |
Efficient solution of partial differential equations require a match between the algorithm and the target architecture. Many recent chip multiprocessors, CMPs (a.k.a. multi-core), feature low intra-thread communication costs and smaller per-thread caches compared to previous shared memory multi-processor systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. A fact that may require a re-evaluation of many existing algorithms.We have investigated parallel implementations of multi-grid methods using a parallel temporally blocked, naturally ordered smoother. Compared to the standard multigrid solution based on a red-black ordering, we improve the data locality often as much as ten times, while our use of a fine-grained locking scheme keeps the parallel efficiency high.Our algorithm was initially inspired by CMPs and it was surprising to see that our OpenMP multigrid implementation ran up to 40 percent faster than the standard red-black algorithm on a contemporary 8-way SMP system. Thanks to the temporal blocking introduced, our smoother implementation often allowed us to apply the smoother two times at the same cost as a single application of a red-black smoother. By executing our smoother on a 32-thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP we demonstrate that such architectures can tolerate the increased communication costs implied by the tradeoffs made in our implementation.