Accelerating the Explicitly Restarted Arnoldi Method with GPUs Using an Autotuned Matrix Vector Product

Authors:
Jérôme Dubois;Christophe Calvin;Serge Petiton
Affiliations:
jerome.dubois@cea.fr;christophe.calvin@cea.fr;serge.petiton@lifl.fr
Venue:
SIAM Journal on Scientific Computing
Year:
2011

Citing 10
Cited 1

Sparse matrix test problems

ACM Transactions on Mathematical Software (TOMS)
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Implicitly Restarted Arnoldi Methods and Subspace Iteration

SIAM Journal on Matrix Analysis and Applications
Multiple Explicitly Restarted Arnoldi Method for Solving Large Eigenproblems

SIAM Journal on Scientific Computing
Matched Filter Computation on FPGA, Cell and GPU

FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The GPU Computing Era

IEEE Micro
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)
Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Parallel Computing

CUDA acceleration of a matrix-free Rosenbrock-K method applied to the shallow water equations

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a parallelized hybrid single-vector Arnoldi algorithm for computing approximations to eigenpairs of a nonsymmetric matrix. We are interested in the use of accelerators and multicore units to speed up the Arnoldi process. The main goal is to propose a parallel version of the Arnoldi solver, which can efficiently use multiple multicore processors or multiple graphics processing units (GPUs) in a mixed coarse and fine grain fashion. In the proposed algorithms, this is achieved by an autotuning of the matrix vector product before starting the Arnoldi eigensolver as well as the reorganization of the data and global communications so that communication time is reduced. The execution time, performance, and scalability are assessed with well-known dense and sparse test matrices on multiple Nehalems, GT200 NVidia Tesla, and next generation Fermi Tesla. With one processor, we see a performance speedup of 2 to 3x when using all the physical cores, and a total speedup of 2 to 8x when adding a GPU to this multicore unit, and hence a speedup of 4 to 24x compared to the sequential solver.