A brief review of the ITPACK project
Journal of Computational and Applied Mathematics - Special issue on iterative methods for the solution of linear systems
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Concurrent number cruncher: a GPU implementation of a general sparse linear solver
International Journal of Parallel, Emergent and Distributed Systems
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs
Parallel Computing
A new approach for sparse matrix vector product on NVIDIA GPUs
Concurrency and Computation: Practice & Experience
The university of Florida sparse matrix collection
ACM Transactions on Mathematical Software (TOMS)
Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs
ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
Parallel preconditioned conjugate gradient algorithm on GPU
Journal of Computational and Applied Mathematics
GPU-based parallel algorithms for sparse nonlinear systems
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
In this study, we discover the parallelism of the forward/backward substitutions (FBS) for two cases and thus propose an efficient preconditioned conjugate gradient algorithm with the modified incomplete Cholesky preconditioner on the GPU (GPUMICPCGA). For our proposed GPUMICPCGA, the following are distinct characteristics: (1) the vector operations are optimized by grouping several vector operations into single kernels, (2) a new kernel of inner product and a new kernel of the sparse matrix-vector multiplication with high optimization are presented, and (3) an efficient parallel implementation of FBS on the GPU (GPUFBS) for two cases are suggested. Numerical results show that our proposed kernels outperform the corresponding ones presented in CUBLAS or CUSPARSE, and GPUFBS is almost 3 times faster than the implementation of FBS using the CUSPARSE library. Furthermore, GPUMICPCGA has better behavior than its counterpart implemented by the CUBLAS and CUSPARSE libraries.