Implementation of a portable nested data-parallel language
Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
SIAM Journal on Scientific Computing
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Sparse matrix computations on manycore GPU's
Proceedings of the 45th annual Design Automation Conference
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Concurrent number cruncher: a GPU implementation of a general sparse linear solver
International Journal of Parallel, Emergent and Distributed Systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Exact sparse matrix-vector multiplication on GPU's and multicore architectures
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Understanding throughput-oriented architectures
Communications of the ACM
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the limits of GPU acceleration
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Accelerating Haskell array codes with multicore GPUs
Proceedings of the sixth workshop on Declarative aspects of multicore programming
EigenCFA: accelerating flow analysis with GPUs
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
Approximate Spreading Activation for Efficient Knowledge Retrieval from Large Datasets
Proceedings of the 2011 conference on Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets
Copperhead: compiling an embedded data parallel language
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Considerations when evaluating microprocessor platforms
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Scalable multi-coloring preconditioning for multi-core CPUs and GPUs
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parallel GMRES implementation for solving sparse linear systems on GPU clusters
Proceedings of the 19th High Performance Computing Symposia
Towards accelerating irregular EDA applications with GPUs
Integration, the VLSI Journal
SIAM Journal on Scientific Computing
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs
Microprocessors & Microsystems
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Efficient matrix-encoded grammars and low latency parallelization strategies for CYK
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Thermal management of a many-core processor under fine-grained parallelism
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Design patterns for scientific computations on sparse matrices
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Sparse matrix-vector multiply on the HICAMP architecture
Proceedings of the 26th ACM international conference on Supercomputing
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
Proceedings of the 26th ACM international conference on Supercomputing
RETRACTED: Color and texture analysis on emerging parallel architectures
International Journal of High Performance Computing Applications
Facing the Multicore-Challenge II
Virtual try on: an application in need of GPU optimization
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
C-DAC's efforts: application kernels on HPC cluster with GPU accelerators
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs
Journal of Computational Physics
Improving GPU sparse matrix-vector multiplication for probabilistic model checking
SPIN'12 Proceedings of the 19th international conference on Model Checking Software
GPU acceleration of probabilistic frequent itemset mining from uncertain databases
Proceedings of the 21st ACM international conference on Information and knowledge management
Computer Science - Research and Development
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
International Journal of High Performance Computing Applications
Fast and accurate GPU-based simulation of virtual garments
Proceedings of the 11th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry
Circuit simulation via matrix exponential method for stiffness handling and parallel processing
Proceedings of the International Conference on Computer-Aided Design
GPU-accelerated preconditioned iterative linear solvers
The Journal of Supercomputing
Fast and memory-efficient minimum spanning tree on the GPU
International Journal of Computational Science and Engineering
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform
The Journal of Supercomputing
The BiConjugate gradient method on GPUs
The Journal of Supercomputing
Comparison based sorting for systems with multiple GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
GPU implementation of a novel hybrid lattice Boltzmann method for non-isothermal flows
Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
Efficient sparse matrix-vector multiplication on x86-based many-core processors
Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimising purely functional GPU programs
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Iterative numerical methods for sampling from high dimensional Gaussian distributions
Statistics and Computing
Assessing the performance of OpenMP programs on the intel xeon phi
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Efficient 3D stencil computations using CUDA
Parallel Computing
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Non-affine Extensions to Polyhedral Code Generation
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
yaSpMV: yet another SpMV framework on GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exascale design space exploration and co-design
Future Generation Computer Systems
GPU-based iterative transmission reconstruction in 3D ultrasound computer tomography
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms
Scientific Programming
Hi-index | 0.02 |
Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.