High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Authors:
Mikhail Smelyanskiy;Karthikeyan Vaidyanathan;Jee Choi;Bálint Joó;Jatin Chhugani;Michael A. Clark;Pradeep Dubey
Affiliations:
Parallel Computing Labs, Intel;Parallel Computing Labs, Intel;Georgia Institute of Technology;Thomas Jefferson National Accelerator;Parallel Computing Labs, Intel;Harvard-Smithsonian Center for Astrophysics;Parallel Computing Labs, Intel
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 10
Cited 1

BI-CGSTAB: a fast and smoothly converging variant of BI-CG for the solution of nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
QCDSP machines: design, performance and cost

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
The BlueGene/L supercomputer and quantum ChromoDynamics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Efficient SIMDization and data management of the Lattice QCD computation on the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Peta-scale lattice quantum chromodynamics on a blue gene/Q supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2-3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 x 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.