Efficient sparse matrix-vector multiplication on x86-based many-core processors

Authors:
Xing Liu;Mikhail Smelyanskiy;Edmond Chow;Pradeep Dubey
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Intel Corporation, Santa Clara, CA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Intel Corporation, Santa Clara, CA, USA
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 8
Cited 0

Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Performance Optimization and Modeling of Blocked Sparse Kernels

International Journal of High Performance Computing Applications
A Comparative Study of Blocking Storage Methods for Sparse Matrices on Multicore Architectures

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 01
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance evaluation of the sparse matrix-vector multiplication on modern architectures

The Journal of Supercomputing
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Exploiting compression opportunities to improve SpMxV performance on shared memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures. We describe an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90% of KNC's achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.