Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs

Authors:
M. Krotkiewski;M. Dabrowski
Affiliations:
Physics of Geological Processes, University of Oslo, Pb 1048 Blindern, 0316 Oslo, Norway;Physics of Geological Processes, University of Oslo, Pb 1048 Blindern, 0316 Oslo, Norway
Venue:
Parallel Computing
Year:
2010

Citing 20
Cited 4

Krylov subspace methods on supercomputers

SIAM Journal on Scientific and Statistical Computing
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Adaptive local refinement with octree load balancing for the parallel solution of three-dimensional conservation laws

Journal of Parallel and Distributed Computing - Special issue on dynamic load balancing
A Comparison of Several Bandwidth and Profile Reduction Algorithms

ACM Transactions on Mathematical Software (TOMS)
Towards a fast parallel sparse symmetric matrix-vector multiplication

Parallel Computing - Linear systems and associated problems
Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

SIAM Review
Reducing the bandwidth of sparse symmetric matrices

ACM '69 Proceedings of the 1969 24th national conference
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication

SIAM Review
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Parallel iterative solvers for finite-element methods using an OpenMP/MPI hybrid programming model on the Earth Simulator

Parallel Computing - OpenMp
A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations

ACM Transactions on Mathematical Software (TOMS)
When cache blocking of sparse matrix vector multiply works and why

Applicable Algebra in Engineering, Communication and Computing
PT-Scotch: A tool for efficient parallel graph ordering

Parallel Computing
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

ACM Transactions on Mathematical Software (TOMS)
Scalable adaptive mantle convection simulation on petascale supercomputers

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Sparse matrix factorization on massively parallel computers
Fast sparse matrix-vector multiplication for TeraFlop/s computers

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science

A novel algorithm for all pairs shortest path problem based on matrix multiplication and pulse coupled neural network

Digital Signal Processing
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
Efficient 3D stencil computations using CUDA

Parallel Computing
Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a massively parallel implementation of symmetric sparse matrix-vector product for modern clusters with scalar multi-core CPUs. Matrices with highly variable structure and density arising from unstructured three-dimensional FEM discretizations of mechanical and diffusion problems are studied. A metric of the effective memory bandwidth is introduced to analyze the impact on performance of a set of simple, well-known optimizations: matrix reordering, manual prefetching, and blocking. A modification to the CRS storage improving the performance on multi-core Opterons is shown. The performance of an entire SMP blade rather than the per-core performance is optimized. Even for the simplest 4 node mechanical element our code utilizes close to 100% of the per-blade available memory bandwidth. We show that reducing the storage requirements for symmetric matrices results in roughly two times speedup. Blocking brings further storage savings and a proportional performance increase. Our results are compared to existing state-of-the-art implementations of SpMV, and to the dense BLAS2 performance. Parallel efficiency on 5400 Opteron cores of the Cray XT4 cluster is around 80-90% for problems with approximately 25^3 mesh nodes per core. For a problem with 820 million degrees of freedom the code runs with a sustained performance of 5.2 TeraFLOPs, over 20% of the theoretical peak.