Characterizing the behavior of sparse algorithms on caches
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Efficient management of parallelism in object-oriented numerical software libraries
Modern software tools for scientific computing
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Design Challenges of Technology Scaling
IEEE Micro
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
On Improving the Performance of Sparse Matrix-Vector Multiplication
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Sparse Tiling for Stationary Iterative Methods
International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Accelerating sparse matrix computations via data compression
Proceedings of the 20th annual international conference on Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
When cache blocking of sparse matrix vector multiply works and why
Applicable Algebra in Engineering, Communication and Computing
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression
Proceedings of the 5th conference on Computing frontiers
Memory hierarchy optimizations and performance bounds for sparse ATAx
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs
Parallel Computing
Adjacency-based data reordering algorithm for acceleration of finite element computations
Scientific Programming
On the limits of GPU acceleration
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Parallelization of the artificial bee colony (ABC) algorithm
NN'10/EC'10/FS'10 Proceedings of the 11th WSEAS international conference on nural networks and 11th WSEAS international conference on evolutionary computing and 11th WSEAS international conference on Fuzzy systems
Parallel implementation of conjugate gradient method on graphics processors
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Accelerating Haskell array codes with multicore GPUs
Proceedings of the sixth workshop on Declarative aspects of multicore programming
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
Optimizing Sparse Data Structures for Matrix-vector Multiply
International Journal of High Performance Computing Applications
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
On the performance of an algebraic multigrid solver on multicore clusters
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Considerations when evaluating microprocessor platforms
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
A scalable eigensolver for large scale-free graphs using 2D graph partitioning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards efficient execution of erasure codes on multicore architectures
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Efficient matrix-encoded grammars and low latency parallelization strategies for CYK
IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
An object-oriented bulk synchronous parallel library for multicore programming
Concurrency and Computation: Practice & Experience
Input-aware auto-tuning for directive-based GPU programming
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Optimising purely functional GPU programs
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
Non-affine Extensions to Polyhedral Code Generation
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiled multithreaded data paths on FPGAs for dynamic workloads
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Applications of the streamed storage format for sparse matrix operations
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific-optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD quad-core, AMD dual-core, and Intel quad-core designs, the heterogeneous STI Cell, as well as one of the first scientific studies of the highly multithreaded Sun Victoria Falls (a Niagara2 SMP). We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural trade-offs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.