Krylov subspace methods on supercomputers
SIAM Journal on Scientific and Statistical Computing
Improving the memory-system performance of sparse-matrix vector multiplication
IBM Journal of Research and Development
Journal of Parallel and Distributed Computing - Special issue on dynamic load balancing
A Comparison of Several Bandwidth and Profile Reduction Algorithms
ACM Transactions on Mathematical Software (TOMS)
Towards a fast parallel sparse symmetric matrix-vector multiplication
Parallel Computing - Linear systems and associated problems
Reducing the bandwidth of sparse symmetric matrices
ACM '69 Proceedings of the 1969 24th national conference
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
ACM Transactions on Mathematical Software (TOMS)
When cache blocking of sparse matrix vector multiply works and why
Applicable Algebra in Engineering, Communication and Computing
PT-Scotch: A tool for efficient parallel graph ordering
Parallel Computing
ACM Transactions on Mathematical Software (TOMS)
Scalable adaptive mantle convection simulation on petascale supercomputers
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fast sparse matrix-vector multiplication for TeraFlop/s computers
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Sparse matrix-vector multiply on the HICAMP architecture
Proceedings of the 26th ACM international conference on Supercomputing
Efficient 3D stencil computations using CUDA
Parallel Computing
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We present a massively parallel implementation of symmetric sparse matrix-vector product for modern clusters with scalar multi-core CPUs. Matrices with highly variable structure and density arising from unstructured three-dimensional FEM discretizations of mechanical and diffusion problems are studied. A metric of the effective memory bandwidth is introduced to analyze the impact on performance of a set of simple, well-known optimizations: matrix reordering, manual prefetching, and blocking. A modification to the CRS storage improving the performance on multi-core Opterons is shown. The performance of an entire SMP blade rather than the per-core performance is optimized. Even for the simplest 4 node mechanical element our code utilizes close to 100% of the per-blade available memory bandwidth. We show that reducing the storage requirements for symmetric matrices results in roughly two times speedup. Blocking brings further storage savings and a proportional performance increase. Our results are compared to existing state-of-the-art implementations of SpMV, and to the dense BLAS2 performance. Parallel efficiency on 5400 Opteron cores of the Cray XT4 cluster is around 80-90% for problems with approximately 25^3 mesh nodes per core. For a problem with 820 million degrees of freedom the code runs with a sustained performance of 5.2 TeraFLOPs, over 20% of the theoretical peak.