An extended set of FORTRAN basic linear algebra subprograms

Authors:
Jack J. Dongarra;Jeremy Du Croz;Sven Hammarling;Richard J. Hanson
Affiliations:
Argonne National Laboratory, Argonne, IL;Numerical Algorithms Group, Ltd., Oxford, UK;Numerical Algorithms Group, Ltd., Oxford, UK;Applied Dynamics International Corporation, Ann Arbor, MI
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1988

Citing 8
Cited 169

Increasing the performance of mathematical software through high-level modularity

Proc. of the sixth int'l. symposium on Computing methods in applied sciences and engineering, VI
Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Squeezing the most out of an algorithm in CRAY FORTRAN

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage [F1]

ACM Transactions on Mathematical Software (TOMS)
Improving the efficiency of portable software for linear algebra

ACM SIGNUM Newsletter
A proposal for an extended set of Fortran Basic Linear Algebra Subprograms

ACM SIGNUM Newsletter
Issues relating to extension of the Basic Linear Algebra Subprograms

ACM SIGNUM Newsletter

Algorithm 666: Chabis: a mathematical software package for locating and evaluating roots of systems of nonlinear equations

ACM Transactions on Mathematical Software (TOMS)
Engineering and scientific subroutine library for the IBM 3090 vector facility

IBM Systems Journal
Algorithm 675: Fortran subroutines for computing the square root covariance filter and square root information filter in dense or Hessenberg forms

ACM Transactions on Mathematical Software (TOMS)
A block QR factorization algorithm using restricted pivoting

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Program optimization and parallelization using idioms

POPL '91 Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Sparse extensions to the FORTRAN Basic Linear Algebra Subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Hierarchical blocking and data flow analysis for numerical linear algebra

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The impact of memory organization on the performance of matrix multiplication

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel algorithm research at CERFACS

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A new approach for automatic parallelization of blocked linear Algebra computations

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Automatic data mapping for distributed-memory parallel computers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Computing selected eigenvalues of sparse unsymmetric matrices using subspace iteration

ACM Transactions on Mathematical Software (TOMS)
Toward parallel mathematical software for elliptic partial differential equations

ACM Transactions on Mathematical Software (TOMS)
A parallel block implementation of Level-3 BLAS for MIMD vector processors

ACM Transactions on Mathematical Software (TOMS)
Program optimization and parallelization using idioms

ACM Transactions on Programming Languages and Systems (TOPLAS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Algorithm 741: least-squares solution of a linear, bordered, block-diagonal system of equations

ACM Transactions on Mathematical Software (TOMS)
Computing the MDMT decomposition

ACM Transactions on Mathematical Software (TOMS)
Efficient vector and parallel manipulation of tensor products

ACM Transactions on Mathematical Software (TOMS)
Algorithm 753: TENPACK: a LAPACK-based library for the computer manipulation of tensor products

ACM Transactions on Mathematical Software (TOMS)
The design of a new frontal code for solving sparse, unsymmetric systems

ACM Transactions on Mathematical Software (TOMS)
Exploiting zeros on the diagonal in the direct solution of indefinite sparse symmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
Parallel reduction of banded matrices to bidiagonal form

Parallel Computing
The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations

Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Algorithm 767: a Fortran 77 package for column reduction of polynomial matrices

ACM Transactions on Mathematical Software (TOMS)
Open implementation design guidelines

ICSE '97 Proceedings of the 19th international conference on Software engineering
Use of parallel level 3 BLAS in LU factorization on three vector multiprocessors the ALLIANT FX/80, the CRAY-2, and the IBM 3090 VF

ICS '90 Proceedings of the 4th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Practical experience in the numerical dangers of heterogeneous computing

ACM Transactions on Mathematical Software (TOMS)
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Efficient householder QR factorization for superscalar processors

ACM Transactions on Mathematical Software (TOMS)
Level 3 basic linear algebra subprograms for sparse matrices: a user-level interface

ACM Transactions on Mathematical Software (TOMS)
The automatic generation of sparse primitives

ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Performance comparisons of Cholesky factorization algorithms using level-2 & 3 BLAS on the national advanced systems AS/XL Vector computer

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Portable and efficient factorization algorithms on the IBM 3090/VF

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Vectorizing a robust inner product algorithm

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Direct numerical simulation of turbulence with a PC/linux cluster: fact or fiction?

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Design and Performance Evaluation of a Portable Parallel Library for Space-Time Adaptive Processing

IEEE Transactions on Parallel and Distributed Systems
Algorithm 800: Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices. I: the square-reduced method

ACM Transactions on Mathematical Software (TOMS)
OoLALA: an object oriented analysis and design of numerical linear algebra

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
PSBLAS: a library for parallel linear algebra computation on sparse matrices

ACM Transactions on Mathematical Software (TOMS)
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
NetSolve: a network server for solving computational science problems

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Automatic translation of Fortran to JVM bytecode

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimization of a parallel ocean general circulation model

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Distributed component architecture for scientific applications

CRPIT '02 Proceedings of the Fortieth International Conference on Tools Pacific: Objects for internet, mobile and embedded applications
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95

ACM Transactions on Mathematical Software (TOMS)
Preface to the special issue on the basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Component-based derivation of a parallel stiff ODE solver implemented in a cluster of computers

International Journal of Parallel Programming
Component-Based Derivation of a Parallel Stiff ODE Solver Implemented in a Cluster of Computers

International Journal of Parallel Programming
Linear Algebra Libraries for High-Performance Computers: A Personal Perspective

IEEE Parallel & Distributed Technology: Systems & Technology
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
Faster Numerical Algorithms Via Exception Handling

IEEE Transactions on Computers
Parallel multiplication of a vector by a kronecker product of matrices

Parallel numerical linear algebra
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Scalable Sparse Matrix Techniques for Modeling Crack Growth

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
An Evaluation of Java for Numerical Computing

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
A Performance Study on a Single Processing Node of the HITACHI SR8000

NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
A new data-mapping scheme for latency-tolerant distributed sparse triangular solution

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Advanced environments for parallel and distributed applications: a view of current status

Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
NetSolve: A Network-Enabled Solver: Examples and Users

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Mathematical software: past, present, and future

Computational science, mathematics and software
Numerical algorithm delivery mechanisms

Computational science, mathematics and software
References

Sourcebook of parallel computing
PMIRKDC: a parallel mono-implicit Runge--Kutta code with defect control for boundary value ODEs

Parallel Computing
Matrix bidiagonalization: implementation and evaluation on the Trident processor

Neural, Parallel & Scientific Computations
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Vector reduction/transformation operators

ACM Transactions on Mathematical Software (TOMS)
Architecture of an automatically tuned linear algebra library

Parallel Computing
MA57---a code for the solution of sparse symmetric definite and indefinite systems

ACM Transactions on Mathematical Software (TOMS)
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Performance Evaluation of Linear Algebra Routines

International Journal of High Performance Computing Applications
Accumulating Householder transformations, revisited

ACM Transactions on Mathematical Software (TOMS)
Improving the performance of reduction to Hessenberg form

ACM Transactions on Mathematical Software (TOMS)
Linear algebra operators for GPU implementation of numerical algorithms

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
An evaluation of Java for numerical computing

Scientific Programming
JLAPACK - compiling LAPACK Fortran to Java

Scientific Programming
Recursive approach in sparse matrix LU factorization

Scientific Programming
Algorithm 867: QUADLOG—a package of routines for generating Gauss-related quadrature for two classes of logarithmic weight functions

ACM Transactions on Mathematical Software (TOMS)
A highly efficient implementation of back propagation algorithm using matrix instruction set architecture

Neural, Parallel & Scientific Computations
Scalable parallelization of FLAME code via the workqueuing model

ACM Transactions on Mathematical Software (TOMS)
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method

Journal of Computational Physics
Parallelization of a method for the solution of the inverse additive singular value problem

MATH'05 Proceedings of the 8th WSEAS International Conference on Applied Mathematics
Parallel global and local convergent algorithms for solving the iniverse additive singular value problem

ISTASC'04 Proceedings of the 4th WSEAS International Conference on Systems Theory and Scientific Computation
A highly efficient implementation of a backpropagation learning algorithm using matrix ISA

Journal of Parallel and Distributed Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Benchmarking Domain-Specific Compiler Optimizations for Variational Forms

ACM Transactions on Mathematical Software (TOMS)
The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Pattern-Driven Automatic Parallelization

Scientific Programming
An Efficient Implementation of the Thomas-Algorithm for Block Penta-diagonal Systems on Vector Computers

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
High Performance Implementation of Binomial Option Pricing

ICCSA '08 Proceeding sof the international conference on Computational Science and Its Applications, Part I
Multidimensional Blocking in UPC

Languages and Compilers for Parallel Computing
A high performance tool for the simulation of the dynamic pantograph-catenary interaction

Mathematics and Computers in Simulation
A unified model for multicore architectures

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
The Mailman algorithm: A note on matrix--vector multiplication

Information Processing Letters
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallelization of Sphere-Decoding Methods

High Performance Computing for Computational Science - VECPAR 2008
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS

ACM Transactions on Mathematical Software (TOMS)
Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software

Cluster Computing
A Parallel Numerical Library for UPC

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
ScaLAPACK's MRRR algorithm

ACM Transactions on Mathematical Software (TOMS)
Accelerating the complex Hessenberg QR algorithm with the CSX600 floating-point coprocessor

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
Blue Gene/L performance tools

IBM Journal of Research and Development
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

ACM Transactions on Mathematical Software (TOMS)
Scaling and pivoting in an out-of-core sparse direct solver

ACM Transactions on Mathematical Software (TOMS)
Paper: Solving almost block diagonal systems on parallel computers

Parallel Computing
The impact of memory organization on the performance of matrix calculations

Parallel Computing
The performance of the BLAS and LAPACK on a shared memory scalar multiprocessor

Parallel Computing
Efficient parallel algorithm for constructing a unit triangular matrix with prescribed singular values

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Self-adapting software for numerical linear algebra library routines on clusters

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Toward memory-efficient linear solvers

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Three versions of a minimal storage Cholesky algorithm using new data structures gives high performance speeds as verified on many computers

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A supernodal out-of-core sparse Gaussian-elimination method

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Performance evaluation of basic linear algebra subroutines on a matrix co-processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
On improving performance and energy profiles of sparse scientific applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
The general matrix multiply-add operation on 2D torus

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Matrix Computation View of FastMap and RobustMap Dimension Reduction Algorithms

SIAM Journal on Matrix Analysis and Applications
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Science of Computer Programming
Adaptive Techniques for Improving the Performance of Incomplete Factorization Preconditioning

SIAM Journal on Scientific Computing
Partial factorization of a dense symmetric indefinite matrix

ACM Transactions on Mathematical Software (TOMS)
An introduction to GPU accelerated surgical simulation

ISBMS'06 Proceedings of the Third international conference on Biomedical Simulation
Parallel optimization methods based on direct search

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
Parallelising matrix operations on clusters for an optimal control-based quantum compiler

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Numerical integration of the differential riccati equation: a high performance computing approach

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
A matrix-type for performance–portability

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance linear algebra libraries

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Semi-automatic generation of grid computing interfaces for numerical software libraries

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel algorithms for the determination of lyapunov characteristics of large nonlinear dynamical systems

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallelization of general matrix multiply routines using OpenMP

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Data mining with parallel support vector machines for classification

ADVIS'06 Proceedings of the 4th international conference on Advances in Information Systems
Two-stage least squares and indirect least squares algorithms for simultaneous equations models

Journal of Computational and Applied Mathematics
High performance BLAS formulation of the adaptive Fast Multipole Method

Mathematical and Computer Modelling: An International Journal
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience
Generalizing matrix multiplication for efficient computations on modern computers

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Modeling performance through memory-stalls

ACM SIGMETRICS Performance Evaluation Review
Families of Algorithms for Reducing a Matrix to Condensed Form

ACM Transactions on Mathematical Software (TOMS)
The babyblas - an extended project for introducing undergraduates to the concepts of high performance and parallel scientific computing

Journal of Computing Sciences in Colleges
UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Concurrency and Computation: Practice & Experience
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
A case study in mechanically deriving dense linear algebra code

International Journal of High Performance Computing Applications
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers.