ACM Transactions on Mathematical Software (TOMS)
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
Implicitly Restarted Arnoldi Methods and Subspace Iteration
SIAM Journal on Matrix Analysis and Applications
Multiple Explicitly Restarted Arnoldi Method for Solving Large Eigenproblems
SIAM Journal on Scientific Computing
Matched Filter Computation on FPGA, Cell and GPU
FCCM '07 Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
IEEE Micro
The university of Florida sparse matrix collection
ACM Transactions on Mathematical Software (TOMS)
CUDA acceleration of a matrix-free Rosenbrock-K method applied to the shallow water equations
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Hi-index | 0.00 |
This paper presents a parallelized hybrid single-vector Arnoldi algorithm for computing approximations to eigenpairs of a nonsymmetric matrix. We are interested in the use of accelerators and multicore units to speed up the Arnoldi process. The main goal is to propose a parallel version of the Arnoldi solver, which can efficiently use multiple multicore processors or multiple graphics processing units (GPUs) in a mixed coarse and fine grain fashion. In the proposed algorithms, this is achieved by an autotuning of the matrix vector product before starting the Arnoldi eigensolver as well as the reorganization of the data and global communications so that communication time is reduced. The execution time, performance, and scalability are assessed with well-known dense and sparse test matrices on multiple Nehalems, GT200 NVidia Tesla, and next generation Fermi Tesla. With one processor, we see a performance speedup of 2 to 3x when using all the physical cores, and a total speedup of 2 to 8x when adding a GPU to this multicore unit, and hence a speedup of 4 to 24x compared to the sequential solver.