The use of BLAS3 in linear algebra on a parallel processor with a hierarchical memory
SIAM Journal on Scientific and Statistical Computing
The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
Block reflectors: theory and computation
SIAM Journal on Numerical Analysis
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage [F1]
ACM Transactions on Mathematical Software (TOMS)
Solving Large Full Sets of Linear Equations in a Paged Virtual Store
ACM Transactions on Mathematical Software (TOMS)
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
Advanced Architecture Computers
Advanced Architecture Computers
Issues relating to extension of the Basic Linear Algebra Subprograms
ACM SIGNUM Newsletter
ACM Transactions on Mathematical Software (TOMS)
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Parallel algorithm research at CERFACS
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Computer Architecture in the 1990s
Computer
A new approach for automatic parallelization of blocked linear Algebra computations
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Stability of block algorithms with fast level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Automatic data mapping for distributed-memory parallel computers
ICS '92 Proceedings of the 6th international conference on Supercomputing
PYRROS: static task scheduling and code generation for message passing multiprocessors
ICS '92 Proceedings of the 6th international conference on Supercomputing
On the parallelization of blocked LU factorization algorithms on distributed memory architectures
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Computing selected eigenvalues of sparse unsymmetric matrices using subspace iteration
ACM Transactions on Mathematical Software (TOMS)
Parallel direct solution of large sparse systems in finite element computations
ICS '93 Proceedings of the 7th international conference on Supercomputing
A proposal of Level 3 interface for band and skyline matrix factorization subroutine
ICS '93 Proceedings of the 7th international conference on Supercomputing
The role of APL and J in high-performance computation
APL '93 Proceedings of the international conference on APL
Toward parallel mathematical software for elliptic partial differential equations
ACM Transactions on Mathematical Software (TOMS)
Introducing a New Cache Design into Vector Computers
IEEE Transactions on Computers
A parallel block implementation of Level-3 BLAS for MIMD vector processors
ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Algorithm 741: least-squares solution of a linear, bordered, block-diagonal system of equations
ACM Transactions on Mathematical Software (TOMS)
An Arnoldi code for computing selected eigenvalues of sparse, real, unsymmetric matrices
ACM Transactions on Mathematical Software (TOMS)
The design of a new frontal code for solving sparse, unsymmetric systems
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
Parallel reduction of banded matrices to bidiagonal form
Parallel Computing
Tuning the performance of I/O-intensive parallel applications
Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
Design and evaluation of dynamic access ordering hardware
ICS '96 Proceedings of the 10th international conference on Supercomputing
Programming language requirements for the next millennium
ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
ICS '90 Proceedings of the 4th international conference on Supercomputing
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Practical experience in the numerical dangers of heterogeneous computing
ACM Transactions on Mathematical Software (TOMS)
Compiler blockability of dense matrix factorizations
ACM Transactions on Mathematical Software (TOMS)
Level 3 basic linear algebra subprograms for sparse matrices: a user-level interface
ACM Transactions on Mathematical Software (TOMS)
A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems
IEEE Transactions on Parallel and Distributed Systems
ACM Transactions on Mathematical Software (TOMS)
The automatic generation of sparse primitives
ACM Transactions on Mathematical Software (TOMS)
An object-oriented framework for block preconditioning
ACM Transactions on Mathematical Software (TOMS)
A combined unifrontal/multifrontal method for unsymmetric sparse matrices
ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues
ACM Transactions on Mathematical Software (TOMS)
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems
ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors
ACM Transactions on Mathematical Software (TOMS)
Direct numerical simulation of turbulence with a PC/linux cluster: fact or fiction?
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
An annotation language for optimizing software libraries
Proceedings of the 2nd conference on Domain-specific languages
ACM Transactions on Mathematical Software (TOMS)
Design and Performance Evaluation of a Portable Parallel Library for Space-Time Adaptive Processing
IEEE Transactions on Parallel and Distributed Systems
Hardware-only stream prefetching and dynamic access ordering
Proceedings of the 14th international conference on Supercomputing
Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization
SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 2
ACM Transactions on Mathematical Software (TOMS)
IEEE Transactions on Parallel and Distributed Systems
OoLALA: an object oriented analysis and design of numerical linear algebra
OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
PSBLAS: a library for parallel linear algebra computation on sparse matrices
ACM Transactions on Mathematical Software (TOMS)
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
NetSolve: a network server for solving computational science problems
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A framework for sparse matrix code synthesis from high-level specifications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Dynamic Access Ordering for Streamed Computations
IEEE Transactions on Computers
Automatic translation of Fortran to JVM bytecode
Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimization of a parallel ocean general circulation model
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
PLAPACK: parallel linear algebra package design overview
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
The Journal of Supercomputing
Register tiling in nonrectangular iteration spaces
ACM Transactions on Programming Languages and Systems (TOPLAS)
An updated set of basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
Design, implementation and testing of extended and mixed precision BLAS
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
Algorithm 818: A reference model implementation of the sparse BLAS in fortran 95
ACM Transactions on Mathematical Software (TOMS)
Preface to the special issue on the basic linear algebra subprograms (BLAS)
ACM Transactions on Mathematical Software (TOMS)
Generic programming for high performance scientific applications
JGI '02 Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande
Implementing Hager's exchange methods for matrix profile reduction
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
Queueing Systems: Theory and Applications
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallel algorithms for LQ optimal control of discrete-time periodic linear systems
Journal of Parallel and Distributed Computing
Linear Algebra Libraries for High-Performance Computers: A Personal Perspective
IEEE Parallel & Distributed Technology: Systems & Technology
The Matrix Template Library: Generic Components for High-Performance Scientific Computing
Computing in Science and Engineering
The Decompositional Approach to Matrix Computation
Computing in Science and Engineering
Faster Numerical Algorithms Via Exception Handling
IEEE Transactions on Computers
On the Granularity and Clustering of Directed Acyclic Task Graphs
IEEE Transactions on Parallel and Distributed Systems
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Generation of Injective and Reversible Modular Mappings
IEEE Transactions on Parallel and Distributed Systems
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Statistical Models for Automatic Performance Tuning
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Cluster Configuration Aided by Simulation
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
LAWRA Workshop: Linear Algebra with Recursive Algorithms: http: //lawra.uni-c.dk/lawra/
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
Using Pentangular Factorizations for the Reduction to Banded Form
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
An Evaluation of Java for Numerical Computing
ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Blocking Techniques in Numerical Software
ParNum '99 Proceedings of the 4th International ACPC Conference Including Special Tracks on Parallel Numerics and Parallel Computing in Image Processing, Video Processing, and Multimedia: Parallel Computation
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Performance Study on a Single Processing Node of the HITACHI SR8000
NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
Recursive Version of LU Decomposition
NAA '00 Revised Papers from the Second International Conference on Numerical Analysis and Its Applications
A new data-mapping scheme for latency-tolerant distributed sparse triangular solution
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Advanced environments for parallel and distributed applications: a view of current status
Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
On parallel block algorithms for exact triangularizations
Parallel Computing
Formal derivation of algorithms: The triangular sylvester equation
ACM Transactions on Mathematical Software (TOMS)
Finite field linear algebra subroutines
Proceedings of the 2002 international symposium on Symbolic and algebraic computation
NetSolve: A Network-Enabled Solver: Examples and Users
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Commodity Clusters: Performance Comparison Between PC's and Workstations
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Flexible Class of Parallel Matrix Multiplication Algorithms
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Caching-Efficient Multithreaded Fast Multiplication of Sparse Matrices
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Linear algebra operators for GPU implementation of numerical algorithms
ACM SIGGRAPH 2003 Papers
Mathematical software: past, present, and future
Computational science, mathematics and software
Numerical algorithm delivery mechanisms
Computational science, mathematics and software
Sourcebook of parallel computing
Parallel frontal solvers for large sparse linear systems
ACM Transactions on Mathematical Software (TOMS)
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
Self-adapting software for numerical linear algebra and LAPACK for clusters
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
IEEE Transactions on Computers
Surface reconstruction based on compactly supported radial basis functions
Geometric modeling
A data locality optimizing algorithm
ACM SIGPLAN Notices - Best of PLDI 1979-1999
A parallel direct solver for large sparse highly unsymmetric linear systems
ACM Transactions on Mathematical Software (TOMS)
MA57---a code for the solution of sparse symmetric definite and indefinite systems
ACM Transactions on Mathematical Software (TOMS)
A column pre-ordering strategy for the unsymmetric-pattern multifrontal method
ACM Transactions on Mathematical Software (TOMS)
Parallel and fully recursive multifrontal sparse Cholesky
Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices
IBM Journal of Research and Development
ACM Transactions on Mathematical Software (TOMS)
A column approximate minimum degree ordering algorithm
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 2nd international conference on Service oriented computing
64-bit floating-point FPGA matrix multiplication
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Fast SVM Training Algorithm with Decomposition on Very Large Data Sets
IEEE Transactions on Pattern Analysis and Machine Intelligence
Supporting Cluster-Based Network Services on Functionally Symmetric Software Architecture
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Early Evaluation of the Cray X1
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A fully portable high performance minimal storage hybrid format Cholesky algorithm
ACM Transactions on Mathematical Software (TOMS)
Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions
Journal of VLSI Signal Processing Systems
Software libraries, numerical and statistical
Encyclopedia of Computer Science
High Performance Computing Systems for Autonomous Spaceborne Missions
International Journal of High Performance Computing Applications
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
High Performance Linear Algebra Operations on Reconfigurable Systems
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
Building the functional performance model of a processor
Proceedings of the 2006 ACM symposium on Applied computing
Accumulating Householder transformations, revisited
ACM Transactions on Mathematical Software (TOMS)
Improving the performance of reduction to Hessenberg form
ACM Transactions on Mathematical Software (TOMS)
Optimizing FIAT with level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Algorithm 854: Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices II
ACM Transactions on Mathematical Software (TOMS)
Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Fast additions on masked integers
ACM SIGPLAN Notices
An object-oriented framework for the development of scalable parallel multilevel preconditioners
ACM Transactions on Mathematical Software (TOMS)
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06 Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures
Deployment of parallel direct sparse linear solvers within a parallel finite element code
PDCN'06 Proceedings of the 24th IASTED international conference on Parallel and distributed computing and networks
Block algorithms for reordering standard and generalized Schur forms
ACM Transactions on Mathematical Software (TOMS)
The design and implementation of the MRRR algorithm
ACM Transactions on Mathematical Software (TOMS)
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Linear algebra operators for GPU implementation of numerical algorithms
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format
ACM Transactions on Mathematical Software (TOMS)
Journal of Computational Physics
Data Partitioning with a Functional Performance Model of Heterogeneous Processors
International Journal of High Performance Computing Applications
ACM Transactions on Mathematical Software (TOMS)
An evaluation of Java for numerical computing
Scientific Programming
JLAPACK - compiling LAPACK Fortran to Java
Scientific Programming
Recursive approach in sparse matrix LU factorization
Scientific Programming
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An annotation language for optimizing software libraries
DSL'99 Proceedings of the 2nd conference on Conference on Domain-Specific Languages - Volume 2
BLASTH, a BLAS library for dual SMP computer
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
ACM Transactions on Mathematical Software (TOMS)
An operation stacking framework for large ensemble computations
Proceedings of the 21st annual international conference on Supercomputing
Certification of the QR factor R and of lattice basis reducedness
Proceedings of the 2007 international symposium on Symbolic and algebraic computation
Data structures for the distributed iterative solution of non-conventional finite element models
Advances in Engineering Software
High Performance Development for High End Computing With Python Language Wrapper (PLW)
International Journal of High Performance Computing Applications
Neural, Parallel & Scientific Computations
Block variants of Hammarling's method for solving Lyapunov equations
ACM Transactions on Mathematical Software (TOMS)
Parallel unsymmetric-pattern multifrontal sparse LU with column preordering
ACM Transactions on Mathematical Software (TOMS)
Scalable parallelization of FLAME code via the workqueuing model
ACM Transactions on Mathematical Software (TOMS)
Analyzing block locality in Morton-order and Morton-hybrid matrices
ACM SIGARCH Computer Architecture News
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method
Journal of Computational Physics
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Designing polylibraries to speed up linear algebra computations
International Journal of High Performance Computing and Networking
Teraflops Sustained Performance With Real World Applications
International Journal of High Performance Computing Applications
Server-based data push architecture for multi-processor environments
Journal of Computer Science and Technology
A highly efficient implementation of a backpropagation learning algorithm using matrix ISA
Journal of Parallel and Distributed Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Effective and scalable software compatibility testing
ISSTA '08 Proceedings of the 2008 international symposium on Software testing and analysis
Dense Linear Algebra over Word-Size Prime Fields: the FFLAS and FFPACK Packages
ACM Transactions on Mathematical Software (TOMS)
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate
ACM Transactions on Mathematical Software (TOMS)
Performance evaluation of supercomputers using HPCC and IMB Benchmarks
Journal of Computer and System Sciences
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Algorithmic performance studies on graphics processing units
Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Performance Model for Parallel Mathematical Libraries Based on Historical Knowledgebase
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
A sparse nonsymmetric eigensolver for distributed memory architectures
International Journal of Parallel, Emergent and Distributed Systems
Dynamic Supernodes in Sparse Cholesky Update/Downdate and Triangular Solves
ACM Transactions on Mathematical Software (TOMS)
A mathematical model of the static pantograph/catenary interaction
International Journal of Computer Mathematics - RECENT ADVANCES IN COMPUTATIONAL AND APPLIED MATHEMATICS IN SCIENCE AND ENGINEERING
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
An out-of-core sparse Cholesky solver
ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Petascale computing with accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
High Performance Computing for Computational Science - VECPAR 2008
LAPACK-Based Condition Estimates for the Discrete-Time LQG Design
Numerical Analysis and Its Applications
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Towards many-core implementation of LU decomposition using Peano Curves
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
Mapping the LU decomposition on a many-core architecture: challenges and solutions
Proceedings of the 6th ACM conference on Computing frontiers
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS
ACM Transactions on Mathematical Software (TOMS)
Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
A Parallel Nonnegative Tensor Factorization Algorithm for Mining Global Climate Data
ICCS 2009 Proceedings of the 9th International Conference on Computational Science
Advanced service trading for scientific computing over the grid
The Journal of Supercomputing
Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Out-of-Core Computation of the QR Factorization on Multi-core Processors
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
On the Need for a Consortium of Capability Centers
International Journal of High Performance Computing Applications
ACM Transactions on Mathematical Software (TOMS)
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Run-time automatic instantiation of algorithms using C++ templates
International Journal of Computational Science and Engineering
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Liquid water: obtaining the right answer for the right reasons
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Blue Gene/L compute chip: control, test, and bring-up infrastructure
IBM Journal of Research and Development
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L
IBM Journal of Research and Development
Standardized mixed language programming for Fortran and C
ACM SIGPLAN Fortran Forum
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A fast and robust mixed-precision solver for the solution of sparse symmetric linear systems
ACM Transactions on Mathematical Software (TOMS)
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion
ACM Transactions on Mathematical Software (TOMS)
Scaling and pivoting in an out-of-core sparse direct solver
ACM Transactions on Mathematical Software (TOMS)
Polymorphic architectures: from media processing to supercomputing
CompSysTech '09 Proceedings of the International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
The impact of memory organization on the performance of matrix calculations
Parallel Computing
A fast parallel optimization for training support vector machine
MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Semantic-based service trading: application to linear algebra
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Self-adapting software for numerical linear algebra library routines on clusters
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Software development in the grid: the DAMIEN tool-set
ICCS'03 Proceedings of the 1st international conference on Computational science: PartI
Toward memory-efficient linear solvers
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Operation Stacking for Ensemble Computations With Variable Convergence
International Journal of High Performance Computing Applications
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
A supernodal out-of-core sparse Gaussian-elimination method
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Performance evaluation of basic linear algebra subroutines on a matrix co-processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture
Proceedings of the 7th ACM international conference on Computing frontiers
Solving path problems on the GPU
Parallel Computing
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
ACM Transactions on Mathematical Software (TOMS)
Algorithm 907: KLU, A Direct Sparse Solver for Circuit Simulation Problems
ACM Transactions on Mathematical Software (TOMS)
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
CFD parallel simulation using Getfem++ and mumps
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Deployment of a hierarchical middleware
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
The general matrix multiply-add operation on 2D torus
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Global Convergence Proof for Cyclic Jacobi Methods with Block Rotations
SIAM Journal on Matrix Analysis and Applications
Partitioned Triangular Tridiagonalization
ACM Transactions on Mathematical Software (TOMS)
Solving Very Sparse Rational Systems of Equations
ACM Transactions on Mathematical Software (TOMS)
An analytical network performance model for SIMD processor CSX600 interconnects
Journal of Systems Architecture: the EUROMICRO Journal
DESOLA: An active linear algebra library using delayed evaluation and runtime code generation
Science of Computer Programming
Exact solutions to linear systems of equations using output sensitive lifting
ACM Communications in Computer Algebra
Adaptive Techniques for Improving the Performance of Incomplete Factorization Preconditioning
SIAM Journal on Scientific Computing
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
Multifrontal computations on GPUs and their multi-core hosts
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Improving CSE software through reproducibility requirements
Proceedings of the 4th International Workshop on Software Engineering for Computational Science and Engineering
A domain-decomposing parallel sparse linear system solver
Journal of Computational and Applied Mathematics
Knowledge-based automatic generation of partitioned matrix expressions
CASC'11 Proceedings of the 13th international conference on Computer algebra in scientific computing
High-performance up-and-downdating via householder-like transformations
ACM Transactions on Mathematical Software (TOMS)
Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization
ACM Transactions on Mathematical Software (TOMS)
Partial factorization of a dense symmetric indefinite matrix
ACM Transactions on Mathematical Software (TOMS)
A note on shifted Hessenberg systems and frequency response computation
ACM Transactions on Mathematical Software (TOMS)
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures
Parallel Computing
Goal-Oriented and Modular Stability Analysis
SIAM Journal on Matrix Analysis and Applications
Computing the Action of the Matrix Exponential, with an Application to Exponential Integrators
SIAM Journal on Scientific Computing
Conditioning and error estimation in the numerical solution of matrix riccati equations
NAA'04 Proceedings of the Third international conference on Numerical Analysis and its Applications
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Network bandwidth measurements and ratio analysis with the HPC challenge benchmark suite (HPCC)
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Journal of Parallel and Distributed Computing
Compiler-optimized kernels: an efficient alternative to hand-coded inner kernels
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
High performance matrix inversion based on LU factorization for multicore architectures
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Parallelising matrix operations on clusters for an optimal control-based quantum compiler
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Empirical performance-model driven data layout optimization
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Comparison of different parallel modified gram-schmidt algorithms
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Automatic tuning of PDGEMM towards optimal performance
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
High performance linear algebra algorithms: an introduction
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A matrix-type for performance–portability
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance linear algebra libraries
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Efficient execution of scientific computation on geographically distributed clusters
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Parallelization of general matrix multiply routines using OpenMP
WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
An implementation of the matrix multiplication algorithm SUMMA in mpf
PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
A static parallel multifrontal solver for finite element meshes
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
The algorithm of multiple relatively robust representations for multi-core processors
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Upper and lower I/O bounds for pebbling r-pyramids
Journal of Discrete Algorithms
High performance BLAS formulation of the adaptive Fast Multipole Method
Mathematical and Computer Modelling: An International Journal
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Concurrency and Computation: Practice & Experience
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Journal of Parallel and Distributed Computing
Concurrency and Computation: Practice & Experience
CUDAICA: GPU optimization of infomax-ICA EEG analysis
Computational Intelligence and Neuroscience - Special issue on Advanced Computational Techniques and Tools for Neuroscience
New level-3 BLAS kernels for cholesky factorization
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Generalizing matrix multiplication for efficient computations on modern computers
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Auto-tuning dense vector and matrix-vector operations for fermi GPUs
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Families of Algorithms for Reducing a Matrix to Condensed Form
ACM Transactions on Mathematical Software (TOMS)
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Layout-oblivious compiler optimization for matrix computations
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Experiments in parallel matrix multiplication on multi-core systems
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Fast Likelihood Computation in Speech Recognition using Matrices
Journal of Signal Processing Systems
Efficient generalized Hessenberg form and applications
ACM Transactions on Mathematical Software (TOMS)
Performance modeling of pipelined linear algebra architectures on FPGAs
ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Cache efficient implementation for block matrix operations
Proceedings of the High Performance Computing Symposium
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A case study in mechanically deriving dense linear algebra code
International Journal of High Performance Computing Applications
Application-tailored linear algebra algorithms: A search-based approach
International Journal of High Performance Computing Applications
VBARMS: A variable block algebraic recursive multilevel solver for sparse linear systems
Journal of Computational and Applied Mathematics
A Basic Linear Algebra Compiler
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ACM Transactions on Architecture and Code Optimization (TACO)
Scheduler vulnerabilities and coordinated attacks in cloud computing
Journal of Computer Security
Computers & Mathematics with Applications
Hi-index | 0.03 |
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers