An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
A logical approach to discrete math
A logical approach to discrete math
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Using PLAPACK: parallel linear algebra package
Using PLAPACK: parallel linear algebra package
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Matrix algorithms
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
MPI: The Complete Reference
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Formal Methods for High-Performance Linear Algebra Libraries
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
A Flexible Class of Parallel Matrix Multiplication Algorithms
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient Parallel Out-of-Core Implementation of the Cholesky Factorization
Efficient Parallel Out-of-Core Implementation of the Cholesky Factorization
POOCLAPACK: Parallel Out-of-Core Linear Algebra Package
POOCLAPACK: Parallel Out-of-Core Linear Algebra Package
Developing Linear Algebra Algorithms: A Collection of Class Projects
Developing Linear Algebra Algorithms: A Collection of Class Projects
A systematic approach to the design and analysis of linear algebra algorithms
A systematic approach to the design and analysis of linear algebra algorithms
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Minimal-storage high-performance Cholesky factorization via blocking and recursion
IBM Journal of Research and Development
Formal derivation of algorithms: The triangular sylvester equation
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Statistical Models for Empirical Search-Based Performance Tuning
International Journal of High Performance Computing Applications
Program generation for the all-pairs shortest path problem
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
FFT program generation for shared memory: SMP and multicore
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Adaptive Strassen's matrix multiplication
Proceedings of the 21st annual international conference on Supercomputing
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Solving Dense Linear Systems on Graphics Processors
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L
IBM Journal of Research and Development
Prospectus for the next LAPACK and ScaLAPACK libraries
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Speeding up Nek5000 with autotuning and specialization
Proceedings of the 24th ACM International Conference on Supercomputing
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
New abstractions for data parallel programming
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Parallel memory prediction for fused linear algebra kernels
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Solving dense interval linear systems with verified computing on multicore architectures
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Designing and dynamically load balancing hybrid LU for multi/many-core
Computer Science - Research and Development
Efficient model order reduction of large-scale systems on multi-core platforms
ICCSA'11 Proceedings of the 2011 international conference on Computational science and Its applications - Volume Part V
Unified Embedded Parallel Finite Element Computations via Software-Based Fréchet Differentiation
SIAM Journal on Scientific Computing
Goal-Oriented and Modular Stability Analysis
SIAM Journal on Matrix Analysis and Applications
Applying software testing metrics to lapack
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A matrix-type for performance–portability
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance linear algebra libraries
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Automatic derivation of linear algebra algorithms with application to control theory
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance out-of-core solvers
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
MadLINQ: large-scale distributed matrix computation for the cloud
Proceedings of the 7th ACM european conference on Computer Systems
Synthesising graphics card programs from DSLs
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
ACM Transactions on Mathematical Software (TOMS)
Concurrency and Computation: Practice & Experience
Families of Algorithms for Reducing a Matrix to Condensed Form
ACM Transactions on Mathematical Software (TOMS)
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Elemental: A New Framework for Distributed Memory Dense Matrix Computations
ACM Transactions on Mathematical Software (TOMS)
Spiral in scala: towards the systematic construction of generators for performance libraries
Proceedings of the 12th international conference on Generative programming: concepts & experiences
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A Basic Linear Algebra Compiler
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the production of an abundance of anfractuous code that is difficult to maintain and almost impossible to upgrade.Having struggled with these issues for more than a decade, we have concluded that a solution is to apply a technique from theoretical computer science, formal derivation, to the development of high-performance linear algebra libraries. We think the resulting approach results in aesthetically pleasing, coherent code that greatly facilitates intelligent modularity and high performance while enhancing confidence in its correctness. Since the technique is language-independent, it lends itself equally well to a wide spectrum of programming languages (and paradigms) ranging from C and Fortran to C++ and Java. In this paper, we illustrate our observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures. This environment demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.We present performance experiments on the Intel (R) Pentium (R) III processor that demonstrate that high performance can be attained by coding at a high level of abstraction.