Preconditioned CG methods for sparse matrices on massively parallel machines
Parallel Computing
FEM Computations on Clusters Using Different Models of Parallel Programming
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
A Technique for Mapping Sparse Matrix Computations into Regular Processor Arrays
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
A New Approach for Accelerating the Sparse Matrix-Vector Multiplication
SYNASC '06 Proceedings of the Eighth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA
Queue - GPU Computing
A new diagonal blocking format and model of cache behavior for sparse matrices
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Towards efficient execution of erasure codes on multicore architectures
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
Nowadays GPUs become extremely promising multi/manycore architectures for a wide range of demanding applications. Basic features of these architectures include utilization of a large number of relatively simple processing units which operate in the SIMD fashion, as well as hardware supported, advanced multithreading. However, the utilization of GPUs in an every-day practice is still limited, mainly because of necessity of deep adaptation of implemented algorithms to a target architecture. In this work, we propose how to perform such an adaptation to achieve an efficient parallel implementation of the conjugate gradient (CG) algorithm, which is widely used for solving large sparse linear systems of equations, arising e.g. in FEM problems. Aiming at efficient implementation of the main operation of the CG algorithm, which is sparse matrix-vector multiplication (SpMV ), different techniques of optimizing access to the hierarchical memory of GPUs are proposed and studied. The experimental investigation of a proposed CUDA-based implementation of the CG algorithm is carried out on two GPU architectures: GeForce 8800 and Tesla C1060. It has been shown that optimization of access to GPU memory allows us to reduce considerably the execution time of the SpMV operation, and consequently to achieve a significant speedup over CPUs when implementing the whole CG algorithm.