How can we speed up matrix multiplication?
SIAM Review
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Multi-teraflops spin dynamics studies of the magnetic structure of FeMn/Co interfaces
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Building the Teraflops/Petabytes Production Supercomputing Center
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The understanding of metallic magnetism is of fundamental importance for a wide range of technological applications ranging from thin film disc drive read heads to bulk magnets used in motors and power generation. In this submission for the Gordon Bell Prize we use the power of massively parallel processing (MPP) computers to perform first principles calculations of large system models of non-equilibrium magnetic states in metallic magnets. The calculations are based on a new constrained local moment (CLM) model that places the recently proposed Spin-Dynamics of Antropov et al. [1] on firm theoretical foundations. The equations of constrained local spin density approximation (constrained LSDA) are solved using the massively parallel locally self-consistent multiple scattering (LSMS) method[3] extended to treat general non-collinear arrangements of the magnetic moments [4]. A general algorithm has been developed for self-consistently finding the constraining fields which are introduced into LSDA in order to maintain a prescribed magnetic moment orientation configuration. The existence of CLM states is demonstrated for 1024 atom per unit cell models of Iron above its Curie temperature. The constrained LSMS method we have developed exploits the locality in the physics of the problem to produce an algorithm that has only local and limited communications on parallel computers leading to very good scale-up to large processor counts and linear scaling of the number of operations with the number of atoms in the system. The computationally intensive step of inversion of a dense complex matrix is largely reduced to matrix-matrix multiplies which are implemented in BLAS. Throughout the code attention is paid to minimizing both the total operation count and total execution time, with primacy given to the latter. Full 64-bit arithmetic is used throughout. The code shows near linear scale-up to 1024-processing elements (PE) and attains a performance of 657 Gflops on a Cray T3E1200 LC1024 at a US Government site. Performance figures of 276 Gflops and 329 Gflops have also been obtained on T3E900 and T3E1200 LC512 machines at the National Energy Research Scientific Computing Center (NERSC), and Cray Research respectively. All performance figures include necessary I/O.