A set of level 3 basic linear algebra subprograms

Authors:
J. J. Dongarra;Jeremy Du Croz;Sven Hammarling;I. S. Duff
Affiliations:
Univ. of Tennessee, Knoxville;Numerical Algorithms Group Ltd., Oxford, UK;Numerical Algorithms Group Ltd., Oxford, UK;Harwell Lab, Oxfordshire, UK
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
1990

Citing 12
Cited 347

The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory

SIAM Journal on Scientific and Statistical Computing
The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Block reflectors: theory and computation

SIAM Journal on Numerical Analysis
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage [F1]

ACM Transactions on Mathematical Software (TOMS)
Solving Large Full Sets of Linear Equations in a Paged Virtual Store

ACM Transactions on Mathematical Software (TOMS)
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Advanced Architecture Computers

Advanced Architecture Computers
Issues relating to extension of the Basic Linear Algebra Subprograms

ACM SIGNUM Newsletter

Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel algorithm research at CERFACS

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Computer Architecture in the 1990s

Computer
A new approach for automatic parallelization of blocked linear Algebra computations

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Stability of block algorithms with fast level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Automatic data mapping for distributed-memory parallel computers

ICS '92 Proceedings of the 6th international conference on Supercomputing
PYRROS: static task scheduling and code generation for message passing multiprocessors

ICS '92 Proceedings of the 6th international conference on Supercomputing
On the parallelization of blocked LU factorization algorithms on distributed memory architectures

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Computing selected eigenvalues of sparse unsymmetric matrices using subspace iteration

ACM Transactions on Mathematical Software (TOMS)
Parallel direct solution of large sparse systems in finite element computations

ICS '93 Proceedings of the 7th international conference on Supercomputing
A proposal of Level 3 interface for band and skyline matrix factorization subroutine

ICS '93 Proceedings of the 7th international conference on Supercomputing
The role of APL and J in high-performance computation

APL '93 Proceedings of the international conference on APL
Toward parallel mathematical software for elliptic partial differential equations

ACM Transactions on Mathematical Software (TOMS)
Introducing a New Cache Design into Vector Computers

IEEE Transactions on Computers
A parallel block implementation of Level-3 BLAS for MIMD vector processors

ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Algorithm 741: least-squares solution of a linear, bordered, block-diagonal system of equations

ACM Transactions on Mathematical Software (TOMS)
An Arnoldi code for computing selected eigenvalues of sparse, real, unsymmetric matrices

ACM Transactions on Mathematical Software (TOMS)
The design of a new frontal code for solving sparse, unsymmetric systems

ACM Transactions on Mathematical Software (TOMS)
LAPACK-style algorithms and software for solving the generalized Sylvester equation and estimating the separation between regular matrix pairs

ACM Transactions on Mathematical Software (TOMS)
The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations

ACM Transactions on Mathematical Software (TOMS)
Exploiting zeros on the diagonal in the direct solution of indefinite sparse symmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
Parallel reduction of banded matrices to bidiagonal form

Parallel Computing
Tuning the performance of I/O-intensive parallel applications

Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations

Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Programming language requirements for the next millennium

ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
Use of parallel level 3 BLAS in LU factorization on three vector multiprocessors the ALLIANT FX/80, the CRAY-2, and the IBM 3090 VF

ICS '90 Proceedings of the 4th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Practical experience in the numerical dangers of heterogeneous computing

ACM Transactions on Mathematical Software (TOMS)
Compiler blockability of dense matrix factorizations

ACM Transactions on Mathematical Software (TOMS)
Level 3 basic linear algebra subprograms for sparse matrices: a user-level interface

ACM Transactions on Mathematical Software (TOMS)
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems

IEEE Transactions on Parallel and Distributed Systems
The design, implementation, and evaluation of a symmetric banded linear solver for distributed-memory parallel computers

ACM Transactions on Mathematical Software (TOMS)
The automatic generation of sparse primitives

ACM Transactions on Mathematical Software (TOMS)
An object-oriented framework for block preconditioning

ACM Transactions on Mathematical Software (TOMS)
A combined unifrontal/multifrontal method for unsymmetric sparse matrices

ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

ACM Transactions on Mathematical Software (TOMS)
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors

ACM Transactions on Mathematical Software (TOMS)
Direct numerical simulation of turbulence with a PC/linux cluster: fact or fiction?

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An annotation language for optimizing software libraries

Proceedings of the 2nd conference on Domain-specific languages
A frontal code for the solution of sparse positive-definite symmetric systems arising from finite-element applications

ACM Transactions on Mathematical Software (TOMS)
Design and Performance Evaluation of a Portable Parallel Library for Space-Time Adaptive Processing

IEEE Transactions on Parallel and Distributed Systems
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization

SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 2
Algorithm 800: Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices. I: the square-reduced method

ACM Transactions on Mathematical Software (TOMS)
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations

IEEE Transactions on Parallel and Distributed Systems
OoLALA: an object oriented analysis and design of numerical linear algebra

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
PSBLAS: a library for parallel linear algebra computation on sparse matrices

ACM Transactions on Mathematical Software (TOMS)
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Implementation of Strassen's algorithm for matrix multiplication

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
NetSolve: a network server for solving computational science problems

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A framework for sparse matrix code synthesis from high-level specifications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Automatic translation of Fortran to JVM bytecode

Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
A recursive formulation of Cholesky factorization of a matrix in packed storage

ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimization of a parallel ocean general circulation model

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers

The Journal of Supercomputing
Register tiling in nonrectangular iteration spaces

ACM Transactions on Programming Languages and Systems (TOPLAS)
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum

ACM Transactions on Mathematical Software (TOMS)
Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95

ACM Transactions on Mathematical Software (TOMS)
Preface to the special issue on the basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Generic programming for high performance scientific applications

JGI '02 Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande
Implementing Hager's exchange methods for matrix profile reduction

ACM Transactions on Mathematical Software (TOMS)
Recursive blocked algorithms for solving triangular systems—Part I: one-sided and coupled Sylvester-type matrix equations

ACM Transactions on Mathematical Software (TOMS)
The Finite Element Method for Computing the Stationary Distribution of an SRBM in a Hypercube with Applications to Finite Buffer Queueing Networks

Queueing Systems: Theory and Applications
EXTENT: a portable programming environment for designing and implementing high-performance block recursive algorithms

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallel algorithms for LQ optimal control of discrete-time periodic linear systems

Journal of Parallel and Distributed Computing
Linear Algebra Libraries for High-Performance Computers: A Personal Perspective

IEEE Parallel & Distributed Technology: Systems & Technology
The Matrix Template Library: Generic Components for High-Performance Scientific Computing

Computing in Science and Engineering
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
Smarter Memory: Improving Bandwidth for Streamed References

Computer
Faster Numerical Algorithms Via Exception Handling

IEEE Transactions on Computers
On the Granularity and Clustering of Directed Acyclic Task Graphs

IEEE Transactions on Parallel and Distributed Systems
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Generation of Injective and Reversible Modular Mappings

IEEE Transactions on Parallel and Distributed Systems
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Statistical Models for Automatic Performance Tuning

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Cluster Configuration Aided by Simulation

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers

ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Using Pentangular Factorizations for the Reduction to Banded Form

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
An Evaluation of Java for Numerical Computing

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
HPF and Numerical Libraries

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Blocking Techniques in Numerical Software

ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Performance Study on a Single Processing Node of the HITACHI SR8000

NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
Recursive Version of LU Decomposition

NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization

Parallel Computing
A new data-mapping scheme for latency-tolerant distributed sparse triangular solution

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Advanced environments for parallel and distributed applications: a view of current status

Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
On parallel block algorithms for exact triangularizations

Parallel Computing
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
Finite field linear algebra subroutines

Proceedings of the 2002 international symposium on Symbolic and algebraic computation
NetSolve: A Network-Enabled Solver: Examples and Users

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Commodity Clusters: Performance Comparison Between PC's and Workstations

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Flexible Class of Parallel Matrix Multiplication Algorithms

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Caching-Efficient Multithreaded Fast Multiplication of Sparse Matrices

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Mathematical software: past, present, and future

Computational science, mathematics and software
Numerical algorithm delivery mechanisms

Computational science, mathematics and software
References

Sourcebook of parallel computing
PMIRKDC: a parallel mono-implicit Runge--Kutta code with defect control for boundary value ODEs

Parallel Computing
Parallel frontal solvers for large sparse linear systems

ACM Transactions on Mathematical Software (TOMS)
Matrix bidiagonalization: implementation and evaluation on the Trident processor

Neural, Parallel & Scientific Computations
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Surface reconstruction based on compactly supported radial basis functions

Geometric modeling
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
A parallel direct solver for large sparse highly unsymmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
MA57---a code for the solution of sparse symmetric definite and indefinite systems

ACM Transactions on Mathematical Software (TOMS)
A column pre-ordering strategy for the unsymmetric-pattern multifrontal method

ACM Transactions on Mathematical Software (TOMS)
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
A numerical evaluation of HSL packages for the direct solution of large sparse, symmetric linear systems of equations

ACM Transactions on Mathematical Software (TOMS)
A column approximate minimum degree ordering algorithm

ACM Transactions on Mathematical Software (TOMS)
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Connecting client objectives with resource capabilities: an essential component for grid service managent infrastructures

Proceedings of the 2nd international conference on Service oriented computing
64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Fast SVM Training Algorithm with Decomposition on Very Large Data Sets

IEEE Transactions on Pattern Analysis and Machine Intelligence
Supporting Cluster-Based Network Services on Functionally Symmetric Software Architecture

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Early Evaluation of the Cray X1

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A fully portable high performance minimal storage hybrid format Cholesky algorithm

ACM Transactions on Mathematical Software (TOMS)
Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions

Journal of VLSI Signal Processing Systems
Software libraries, numerical and statistical

Encyclopedia of Computer Science
High Performance Computing Systems for Autonomous Spaceborne Missions

International Journal of High Performance Computing Applications
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
High Performance Linear Algebra Operations on Reconfigurable Systems

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
Building the functional performance model of a processor

Proceedings of the 2006 ACM symposium on Applied computing
Accumulating Householder transformations, revisited

ACM Transactions on Mathematical Software (TOMS)
Improving the performance of reduction to Hessenberg form

ACM Transactions on Mathematical Software (TOMS)
Optimizing FIAT with level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Algorithm 854: Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices II

ACM Transactions on Mathematical Software (TOMS)
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver

Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Fast additions on masked integers

ACM SIGPLAN Notices
An object-oriented framework for the development of scalable parallel multilevel preconditioners

ACM Transactions on Mathematical Software (TOMS)
Analyzing block locality in Morton-order and Morton-hybrid matrices

MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Deployment of parallel direct sparse linear solvers within a parallel finite element code

PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
Block algorithms for reordering standard and generalized Schur forms

ACM Transactions on Mathematical Software (TOMS)
The design and implementation of the MRRR algorithm

ACM Transactions on Mathematical Software (TOMS)
Interpolating implicit surfaces from scattered surface data using compactly supported radial basis functions

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Linear algebra operators for GPU implementation of numerical algorithms

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

ACM Transactions on Mathematical Software (TOMS)
Basis selection in LOBPCG

Journal of Computational Physics
Data Partitioning with a Functional Performance Model of Heterogeneous Processors

International Journal of High Performance Computing Applications
A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations

ACM Transactions on Mathematical Software (TOMS)
An evaluation of Java for numerical computing

Scientific Programming
JLAPACK - compiling LAPACK Fortran to Java

Scientific Programming
Recursive approach in sparse matrix LU factorization

Scientific Programming
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An annotation language for optimizing software libraries

DSL'99 Proceedings of the 2nd conference on Conference on Domain-Specific Languages - Volume 2
BLASTH, a BLAS library for dual SMP computer

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Algorithm 867: QUADLOG—a package of routines for generating Gauss-related quadrature for two classes of logarithmic weight functions

ACM Transactions on Mathematical Software (TOMS)
An operation stacking framework for large ensemble computations

Proceedings of the 21st annual international conference on Supercomputing
Certification of the QR factor R and of lattice basis reducedness

Proceedings of the 2007 international symposium on Symbolic and algebraic computation
Data structures for the distributed iterative solution of non-conventional finite element models

Advances in Engineering Software
High Performance Development for High End Computing With Python Language Wrapper (PLW)

International Journal of High Performance Computing Applications
A highly efficient implementation of back propagation algorithm using matrix instruction set architecture

Neural, Parallel & Scientific Computations
Block variants of Hammarling's method for solving Lyapunov equations

ACM Transactions on Mathematical Software (TOMS)
Data distribution for dense factorization on computers with memory heterogeneity

Parallel Computing
Parallel unsymmetric-pattern multifrontal sparse LU with column preordering

ACM Transactions on Mathematical Software (TOMS)
Scalable parallelization of FLAME code via the workqueuing model

ACM Transactions on Mathematical Software (TOMS)
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method

Journal of Computational Physics
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Designing polylibraries to speed up linear algebra computations

International Journal of High Performance Computing and Networking
Teraflops Sustained Performance With Real World Applications

International Journal of High Performance Computing Applications
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
A highly efficient implementation of a backpropagation learning algorithm using matrix ISA

Journal of Parallel and Distributed Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Effective and scalable software compatibility testing

ISSTA '08 Proceedings of the 2008 international symposium on Software testing and analysis
Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages

ACM Transactions on Mathematical Software (TOMS)
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate

ACM Transactions on Mathematical Software (TOMS)
Performance evaluation of supercomputers using HPCC and IMB Benchmarks

Journal of Computer and System Sciences
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Algorithmic performance studies on graphics processing units

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Quick Matrix Multiplication on Clusters of Workstations

Informatica
An Efficient Implementation of the Thomas-Algorithm for Block Penta-diagonal Systems on Vector Computers

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Performance Model for Parallel Mathematical Libraries Based on Historical Knowledgebase

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
A sparse nonsymmetric eigensolver for distributed memory architectures

International Journal of Parallel, Emergent and Distributed Systems
Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves

ACM Transactions on Mathematical Software (TOMS)
A mathematical model of the static pantograph/catenary interaction

International Journal of Computer Mathematics - RECENT ADVANCES IN COMPUTATIONAL AND APPLIED MATHEMATICS IN SCIENCE AND ENGINEERING
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
An out-of-core sparse Cholesky solver

ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving the Performance of a Verified Linear System Solver Using Optimized Libraries and Parallel Computation

High Performance Computing for Computational Science - VECPAR 2008
LAPACK-Based Condition Estimates for the Discrete-Time LQG Design

Numerical Analysis and Its Applications
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Towards many-core implementation of LU decomposition using Peano Curves

Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS

ACM Transactions on Mathematical Software (TOMS)
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A Parallel Nonnegative Tensor Factorization Algorithm for Mining Global Climate Data

ICCS 2009 Proceedings of the 9th International Conference on Computational Science
Advanced service trading for scientific computing over the grid

The Journal of Supercomputing
Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Out-of-Core Computation of the QR Factorization on Multi-core Processors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
On the Need for a Consortium of Capability Centers

International Journal of High Performance Computing Applications
ScaLAPACK's MRRR algorithm

ACM Transactions on Mathematical Software (TOMS)
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Run-time automatic instantiation of algorithms using C++ templates

International Journal of Computational Science and Engineering
Sparse matrix factorization on massively parallel computers
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Liquid water: obtaining the right answer for the right reasons

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Blue Gene/L compute chip: control, test, and bring-up infrastructure

IBM Journal of Research and Development
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Standardized mixed language programming for Fortran and C

ACM SIGPLAN Fortran Forum
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A fast and robust mixed-precision solver for the solution of sparse symmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

ACM Transactions on Mathematical Software (TOMS)
Scaling and pivoting in an out-of-core sparse direct solver

ACM Transactions on Mathematical Software (TOMS)
Polymorphic architectures: from media processing to supercomputing

CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
The impact of memory organization on the performance of matrix calculations

Parallel Computing
The performance of the BLAS and LAPACK on a shared memory scalar multiprocessor

Parallel Computing
A fast parallel optimization for training support vector machine

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Semantic-based service trading: application to linear algebra

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Self-adapting software for numerical linear algebra library routines on clusters

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Software development in the grid: the DAMIEN tool-set

ICCS'03 Proceedings of the 1st international conference on Computational science: PartI
Toward memory-efficient linear solvers

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Operation Stacking for Ensemble Computations With Variable Convergence

International Journal of High Performance Computing Applications
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Three versions of a minimal storage Cholesky algorithm using new data structures gives high performance speeds as verified on many computers

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A supernodal out-of-core sparse Gaussian-elimination method

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Performance evaluation of basic linear algebra subroutines on a matrix co-processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers
Solving path problems on the GPU

Parallel Computing
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Parallel Solvers for Sylvester-Type Matrix Equations with Applications in Condition Estimation, Part I: Theory and Algorithms

ACM Transactions on Mathematical Software (TOMS)
Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems

ACM Transactions on Mathematical Software (TOMS)
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
CFD parallel simulation using Getfem++ and mumps

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Deployment of a hierarchical middleware

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The general matrix multiply-add operation on 2D torus

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Global Convergence Proof for Cyclic Jacobi Methods with Block Rotations

SIAM Journal on Matrix Analysis and Applications
Partitioned Triangular Tridiagonalization

ACM Transactions on Mathematical Software (TOMS)
Solving Very Sparse Rational Systems of Equations

ACM Transactions on Mathematical Software (TOMS)
An analytical network performance model for SIMD processor CSX600 interconnects

Journal of Systems Architecture: the EUROMICRO Journal
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation

Science of Computer Programming
Exact solutions to linear systems of equations using output sensitive lifting

ACM Communications in Computer Algebra
Adaptive Techniques for Improving the Performance of Incomplete Factorization Preconditioning

SIAM Journal on Scientific Computing
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems

SIAM Journal on Scientific Computing
Multifrontal computations on GPUs and their multi-core hosts

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Improving CSE software through reproducibility requirements

Proceedings of the 4th International Workshop on Software Engineering for Computational Science and Engineering
A domain-decomposing parallel sparse linear system solver

Journal of Computational and Applied Mathematics
Knowledge-based automatic generation of partitioned matrix expressions

CASC'11 Proceedings of the 13th international conference on Computer algebra in scientific computing
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
High-performance up-and-downdating via householder-like transformations

ACM Transactions on Mathematical Software (TOMS)
Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization

ACM Transactions on Mathematical Software (TOMS)
Partial factorization of a dense symmetric indefinite matrix

ACM Transactions on Mathematical Software (TOMS)
A note on shifted Hessenberg systems and frequency response computation

ACM Transactions on Mathematical Software (TOMS)
First-principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures

Parallel Computing
Goal-Oriented and Modular Stability Analysis

SIAM Journal on Matrix Analysis and Applications
Computing the Action of the Matrix Exponential, with an Application to Exponential Integrators

SIAM Journal on Scientific Computing
Conditioning and error estimation in the numerical solution of matrix riccati equations

NAA'04 Proceedings of the Third international conference on Numerical Analysis and its Applications
HeteroMPI+ScaLAPACK: towards a ScaLAPACK (dense linear solvers) on heterogeneous networks of computers

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Network bandwidth measurements and ratio analysis with the HPC challenge benchmark suite (HPCC)

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Journal of Parallel and Distributed Computing
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
High performance matrix inversion based on LU factorization for multicore architectures

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Parallelising matrix operations on clusters for an optimal control-based quantum compiler

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Partial spectral information from linear systems to speed-up numerical simulations in computational fluid dynamics

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Empirical performance-model driven data layout optimization

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Comparison of different parallel modified gram-schmidt algorithms

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Automatic tuning of PDGEMM towards optimal performance

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
High performance linear algebra algorithms: an introduction

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A matrix-type for performance–portability

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance linear algebra libraries

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Efficient execution of scientific computation on geographically distributed clusters

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallel algorithms for the determination of lyapunov characteristics of large nonlinear dynamical systems

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallelization of general matrix multiply routines using OpenMP

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
An implementation of the matrix multiplication algorithm SUMMA in mpf

PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
A static parallel multifrontal solver for finite element meshes

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
The algorithm of multiple relatively robust representations for multi-core processors

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Upper and lower I/O bounds for pebbling r-pyramids

Journal of Discrete Algorithms
High performance BLAS formulation of the adaptive Fast Multipole Method

Mathematical and Computer Modelling: An International Journal
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience
CUDAICA: GPU optimization of infomax-ICA EEG analysis

Computational Intelligence and Neuroscience - Special issue on Advanced Computational Techniques and Tools for Neuroscience
New level-3 BLAS kernels for cholesky factorization

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Generalizing matrix multiplication for efficient computations on modern computers

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Auto-tuning dense vector and matrix-vector operations for fermi GPUs

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modeling

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Families of Algorithms for Reducing a Matrix to Condensed Form

ACM Transactions on Mathematical Software (TOMS)
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A novel algorithm of optimal matrix partitioning for parallel dense factorization on heterogeneous processors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Experiments in parallel matrix multiplication on multi-core systems

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Fast Likelihood Computation in Speech Recognition using Matrices

Journal of Signal Processing Systems
Efficient generalized Hessenberg form and applications

ACM Transactions on Mathematical Software (TOMS)
Performance modeling of pipelined linear algebra architectures on FPGAs

ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
Cache efficient implementation for block matrix operations

Proceedings of the High Performance Computing Symposium
Interfaces are key

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A case study in mechanically deriving dense linear algebra code

International Journal of High Performance Computing Applications
Application-tailored linear algebra algorithms: A search-based approach

International Journal of High Performance Computing Applications
VBARMS: A variable block algebraic recursive multilevel solver for sparse linear systems

Journal of Computational and Applied Mathematics
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)
Scheduler vulnerabilities and coordinated attacks in cloud computing

Journal of Computer Security
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.03

Visualization

Abstract

This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers