Performance and numerical accuracy evaluation of heterogeneous multicore systems for Krylov orthogonal basis computation

Authors:
Jérôme Dubois;Christophe Calvin;Serge Petiton
Affiliations:
Commissariat l'Energie Atomique, CEA-Saclay, DEN, DANS, DM2S, SERMA, LLPR, Gif-sur-Yvette Cedex, France and Université de Lille 1, Laboratoire d'Informatique Fondamentale de Lille, Villeneuve ...;Commissariat l'Energie Atomique, CEA-Saclay, DEN, DANS, DM2S, SERMA, LLPR, Gif-sur-Yvette Cedex, France;Université de Lille 1, Laboratoire d'Informatique Fondamentale de Lille, Villeneuve d'Ascq Cedex, France
Venue:
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Year:
2010

Citing 8
Cited 0

Sparse matrix test problems

ACM Transactions on Mathematical Software (TOMS)
What every computer scientist should know about floating-point arithmetic

ACM Computing Surveys (CSUR)
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
On the Role of Orthogonality in the GMRES Method

SOFSEM '96 Proceedings of the 23rd Seminar on Current Trends in Theory and Practice of Informatics: Theory and Practice of Informatics
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Rounding error analysis of the classical Gram-Schmidt orthogonalization process

Numerische Mathematik
Parallel Arnoldi eigensolvers with enhanced scalability via global communications rearrangement

Parallel Computing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the numerical behavior of heterogeneous systems such as CPU with GPU or IBM Cell processors for some orthogonalization processes. We focus on the influence of the different floating arithmetic handling of these accelerators with Gram-Schmidt orthogonalization using single and double precision. We observe for dense matrices a loss of at worst 1 digit for CUDA-enabled GPUs as well as a speed-up of 20×, and 2 digits for the Cell processor for a 7× speed-up. For sparse matrices, the result between CPU and GPU is very close and the speed-up is 10×. We conclude that the Cell processor is a good accelerator for double precision because of its full IEEE compliance, and not sufficient for single precision applications. The GPU speed-up is better than Cell and the decent IEEE support delivers results close to the CPU ones for both precisions.