Optimization of geometric multigrid for emerging multi- and manycore processors

Authors:
Samuel Williams;Dhiraj D. Kalamkar;Amik Singh;Anand M. Deshpande;Brian Van Straalen;Mikhail Smelyanskiy;Ann Almgren;Pradeep Dubey;John Shalf;Leonid Oliker
Affiliations:
Lawrence Berkeley National Laboratory;Intel Corporation;University of California Berkeley;Intel Corporation;Lawrence Berkeley National Laboratory;Intel Corporation;Lawrence Berkeley National Laboratory;Intel Corporation;Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 15
Cited 3

Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Cache-Efficient Multigrid Algorithms

International Journal of High Performance Computing Applications
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Autotuning multigrid with PetaBricks

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Vectorized higher order finite difference kernels

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Assessing the performance of OpenMP programs on the intel xeon phi

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
A fourth-order approximate projection method for the incompressible Navier-Stokes equations on locally-refined periodic domains

Applied Numerical Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi™ coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.