Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations

Authors:
Edward Rothberg;Anoop Gupta
Affiliations:
Department of Computer Science, Stanford University, Stanford, CA;Department of Computer Science, Stanford University, Stanford, CA
Venue:
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Year:
1990

Citing 11
Cited 11

Parallel implementation of multifrontal schemes

Parallel Computing
Solving planar systems of equations on distributed-memory multiprocessors

Solving planar systems of equations on distributed-memory multiprocessors
Sparse matrix test problems

ACM Transactions on Mathematical Software (TOMS)
A fan-in algorithm for distributed sparse numerical factorization

SIAM Journal on Scientific and Statistical Computing
The role of elimination trees in sparse factorization

SIAM Journal on Matrix Analysis and Applications
Task scheduling for parallel sparse Cholesky factorization

International Journal of Parallel Programming
Efficient sparse matrix factorization on high performance workstations—exploiting the memory hierarchy

ACM Transactions on Mathematical Software (TOMS)
Squeezing the most out of an algorithm in CRAY FORTRAN

ACM Transactions on Mathematical Software (TOMS)
The Multifrontal Solution of Indefinite Sparse Symmetric Linear

ACM Transactions on Mathematical Software (TOMS)
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
A comparative evaluation of nodal and supernodal parallel sparse matrix factorization: detailed simulation results

A comparative evaluation of nodal and supernodal parallel sparse matrix factorization: detailed simulation results

The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Coarse-grain parallel programming in Jade

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance debugging shared memory multiprocessor programs with MTOOL

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Characterizing the behavior of sparse algorithms on caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Data locality and load balancing in COOL

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
BOS is boss: a case for bulk-synchronous object systems

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
COOL: An Object-Based Language for Parallel Programming

Computer
Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications

IEEE Transactions on Parallel and Distributed Systems
Algorithmic performance studies on graphics processing units

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study the problem of factoring large sparse systems of equations on high-performance multiprocessor workstations. While these multiprocessor workstations are capable of very high peak floating point computation rates, most existing sparse factorization codes achieve only a small fraction of this potential. A major limiting factor is the cost of performing memory accesses. In this paper, we describe a parallel factorization code which utilizes the supernodal structure of the matrix to substantially reduce the number of memory references. We also propose enhancements that significantly reduce the overall cache miss rate. The result is greatly increased factorization performance. We present experimental results from executions on the Silicon Graphics 4D/380 multiprocessor. Using eight processors, the parallel supernodal code achieves a computation rate of approximately 40 MFLOPS when factoring a range of benchmark matrices. This is more than twice as fast as previously used parallel nodal approaches.