Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Authors:
Guy E. Blelloch;Michael A. Heroux;Marco Zagha
Affiliations:
-;-;-
Venue:
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Year:
1993

Citing 0
Cited 22

Flattening and parallelizing irregular, recurrent loop nests

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Accounting for memory bank contention and delay in high-bandwidth multiprocessors

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Functional array fusion

Proceedings of the sixth ACM SIGPLAN international conference on Functional programming
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum

ACM Transactions on Mathematical Software (TOMS)
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems
Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Proceedings of the 23rd international conference on Supercomputing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
State-of-the-art in heterogeneous computing

Scientific Programming
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Optimization of sparse matrix-vector multiplication by auto selecting storage schemes on GPU

ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part II
Oracle scheduling: controlling granularity in implicitly parallel languages

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Vectorized sparse matrix multiply for compressed row storage format

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each processor and parallelizes across processors. Because of our method''s insensitivity to relative row size, it is better suited than the Ellpack/Itpack or the Jagged Diagonal algorithms for matrices which have a varying number of non-zero elements in each row. Furthermore, our approach requires less preprocessing (no more time than a single sparse matrix-vector multiplication), less auxiliary storage, and uses a more convenient data representation (an augmented form of the standard compressed sparse row format). We have implemented our algorithm (SEGMV) on the Cray Y-MP C90, and have compared its performance with other methods on a variety of sparse matrices from the Harwell-Boeing collection and industrial application codes. Our performance on the test matrices is up to 3 times faster than the Jagged Diagonal algorithm and up to 5 times faster than Ellpack/Itpack method. Our preprocessing time is an order of magnitude faster than for the Jagged Diagonal algorithm. Also, using an assembly language implementation of SEGMV on a 16 processor C90, the NAS Conjugate Gradient benchmark runs at 3.5 gigaflops.