FLAME: Formal Linear Algebra Methods Environment

Authors:
John A. Gunnels;Fred G. Gustavson;Greg M. Henry;Robert A. van de Geijn
Affiliations:
The University of Texas at Austin, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY;Intel Corporation, Hillsboro, OR;The University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2001

Citing 26
Cited 52

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
A logical approach to discrete math

A logical approach to discrete math
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Matrix algorithms

Matrix algorithms
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
MPI: The Complete Reference

MPI: The Complete Reference
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Formal Methods for High-Performance Linear Algebra Libraries

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
A Flexible Class of Parallel Matrix Multiplication Algorithms

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Efficient Parallel Out-of-Core Implementation of the Cholesky Factorization

Efficient Parallel Out-of-Core Implementation of the Cholesky Factorization
POOCLAPACK: Parallel Out-of-Core Linear Algebra Package

POOCLAPACK: Parallel Out-of-Core Linear Algebra Package
Developing Linear Algebra Algorithms: A Collection of Class Projects

Developing Linear Algebra Algorithms: A Collection of Class Projects
A systematic approach to the design and analysis of linear algebra algorithms

A systematic approach to the design and analysis of linear algebra algorithms
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Minimal-storage high-performance Cholesky factorization via blocking and recursion

IBM Journal of Research and Development

Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Statistical Models for Empirical Search-Based Performance Tuning

International Journal of High Performance Computing Applications
Program generation for the all-pairs shortest path problem

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Adaptive Strassen's matrix multiplication

Proceedings of the 21st annual international conference on Supercomputing
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Solving Dense Linear Systems on Graphics Processors

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
Adaptive Winograd's matrix multiplications

ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Automating the generation of composed linear algebra kernels

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Prospectus for the next LAPACK and ScaLAPACK libraries

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
New abstractions for data parallel programming

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Using hybrid CPU-GPU platforms to accelerate the computation of the matrix sign function

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Solving dense interval linear systems with verified computing on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Designing and dynamically load balancing hybrid LU for multi/many-core

Computer Science - Research and Development
Efficient model order reduction of large-scale systems on multi-core platforms

ICCSA'11 Proceedings of the 2011 international conference on Computational science and Its applications - Volume Part V
Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

ACM Transactions on Mathematical Software (TOMS)
Unified Embedded Parallel Finite Element Computations via Software-Based Fréchet Differentiation

SIAM Journal on Scientific Computing
Goal-Oriented and Modular Stability Analysis

SIAM Journal on Matrix Analysis and Applications
Applying software testing metrics to lapack

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
A matrix-type for performance–portability

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance linear algebra libraries

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Automatic derivation of linear algebra algorithms with application to control theory

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Rapid development of high-performance out-of-core solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Synthesising graphics card programs from DSLs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

ACM Transactions on Mathematical Software (TOMS)
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience
Families of Algorithms for Reducing a Matrix to Condensed Form

ACM Transactions on Mathematical Software (TOMS)
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Elemental: A New Framework for Distributed Memory Dense Matrix Computations

ACM Transactions on Mathematical Software (TOMS)
Spiral in scala: towards the systematic construction of generators for performance libraries

Proceedings of the 12th international conference on Generative programming: concepts & experiences
Interfaces are key

SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the production of an abundance of anfractuous code that is difficult to maintain and almost impossible to upgrade.Having struggled with these issues for more than a decade, we have concluded that a solution is to apply a technique from theoretical computer science, formal derivation, to the development of high-performance linear algebra libraries. We think the resulting approach results in aesthetically pleasing, coherent code that greatly facilitates intelligent modularity and high performance while enhancing confidence in its correctness. Since the technique is language-independent, it lends itself equally well to a wide spectrum of programming languages (and paradigms) ranging from C and Fortran to C++ and Java. In this paper, we illustrate our observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures. This environment demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.We present performance experiments on the Intel (R) Pentium (R) III processor that demonstrate that high performance can be attained by coding at a high level of abstraction.