Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Authors:
Jaeyoung Choi;Jack J. Dongarra;L. Susan Ostrouchov;Antoine P. Petitet;David W. Walker;R. Clint Whaley
Affiliations:
Department of Computer Science, University of Tennessee at Knoxville, 107 Ayres Hall, Knoxville, TN 37996-1301/ e-mail: {choi,dongarra,sost,petitet,rwhaley}@cs.utk.edu;Dept. of Comp. Sci., Univ. of Tennessee at Knoxville, 107 Ayres Hall, Knoxville, TN 37996-1301 and Math. Sci. Section, Oak Ridge Natnl. Lab., P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367/ e ...;Department of Computer Science, University of Tennessee at Knoxville, 107 Ayres Hall, Knoxville, TN 37996-1301/ e-mail: {choi,dongarra,sost,petitet,rwhaley}@cs.utk.edu;Department of Computer Science, University of Tennessee at Knoxville, 107 Ayres Hall, Knoxville, TN 37996-1301/ e-mail: {choi,dongarra,sost,petitet,rwhaley}@cs.utk.edu;Mathematical Sciences Section, Oak Ridge National Laboratory, P.O. Box 2008, Bldg. 6012, Oak Ridge, TN 37831-6367/ e-mail: walker@rios2.epm.oral.gov;Department of Computer Science, University of Tennessee at Knoxville, 107 Ayres Hall, Knoxville, TN 37996-1301/ e-mail: {choi,dongarra,sost,petitet,rwhaley}@cs.utk.edu
Venue:
Scientific Programming
Year:
1996

Citing 10
Cited 41

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Parallel algorithms for dense linear algebra computations

SIAM Review
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Level 3 BLAS for distributed memory concurrent computers

Environments and tools for parallel scientific computing
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860

LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860

A Programming Methodology for Dual-Tier Multicomputers

IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
Optimizing locality for ODE solvers

ICS '01 Proceedings of the 15th international conference on Supercomputing
A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers)

IEEE Transactions on Computers
A Distributed Framework for Parallel Data Mining Using HPJava

BT Technology Journal
Parallel Factorizations with Algorithmic Blocking

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Pipelining for Locality Improvement in RK Methods

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Using Pentangular Factorizations for the Reduction to Banded Form

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Parallel Image Processing System on a Cluster of Personal Computers (Best Student Paper Award: First Prize)

VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Performance optimization of RK methods using block-based pipelining

Performance analysis and grid computing
Architecture of an automatically tuned linear algebra library

Parallel Computing
A framework for adaptive algorithm selection in STAPL

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Journal of Parallel and Distributed Computing
Self-adapting numerical software (SANS) effort

IBM Journal of Research and Development
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Data partitioning for multiprocessors with memory heterogeneity and memory constraints

Scientific Programming - International Symposium of Parallel and Distributed Computing & International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogenous Networks
Improving locality for ODE solvers by program transformations

Scientific Programming
Data distribution for dense factorization on computers with memory heterogeneity

Parallel Computing
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data parallel scheduling of operations in linear algebra on heterogeneous clusters

DIWEB'06 Proceedings of the 5th WSEAS International Conference on Distance Learning and Web Engineering
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
A simulator for adaptive parallel applications

Journal of Computer and System Sciences
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Design for Interoperability in stapl: pMatrices and Linear Algebra Algorithms

Languages and Compilers for Parallel Computing
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

ACM Transactions on Mathematical Software (TOMS)
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
On iterative QR pre-processing in the parallel block-Jacobi SVD algorithm

Parallel Computing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A variable group block distribution strategy for dense factorizations on networks of heterogeneous computers

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modeling

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Locality optimized shared-memory implementations of iterated runge-kutta methods

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A novel algorithm of optimal matrix partitioning for parallel dense factorization on heterogeneous processors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Speeding up NEC electromagnetic simulation using GPU technology for antenna design problems

International Journal of Computing Science and Mathematics
Visualizing large-scale parallel communication traces using a particle animation technique

EuroVis '13 Proceedings of the 15th Eurographics Conference on Visualization

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon System.