A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Note On Parallel Matrix Inversion
SIAM Journal on Scientific Computing
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
A family of high-performance matrix multiplication algorithms
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Adaptive Strassen's matrix multiplication
Proceedings of the 21st annual international conference on Supercomputing
Scalable parallelization of FLAME code via the workqueuing model
ACM Transactions on Mathematical Software (TOMS)
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Cray XT4: an early evaluation for petascale scientific simulation
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Combining building blocks for parallel multi-level matrix multiplication
Parallel Computing
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
BTL++: From Performance Assessment to Optimal Libraries
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
A unified model for multicore architectures
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Adaptive Winograd's matrix multiplications
ACM Transactions on Mathematical Software (TOMS)
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
ACM Transactions on Mathematical Software (TOMS)
Petascale computing with accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
Scientific Programming - High Performance Computing with the Cell Broadband Engine
An evaluation of the TRIPS computer system
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Towards many-core implementation of LU decomposition using Peano Curves
Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS
ACM Transactions on Mathematical Software (TOMS)
Cache-optimal algorithms for option pricing
ACM Transactions on Mathematical Software (TOMS)
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automating the generation of composed linear algebra kernels
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training
The Journal of Machine Learning Research
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fine tuning matrix multiplications on multicore
HiPC'08 Proceedings of the 15th international conference on High performance computing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
Proceedings of the 24th ACM International Conference on Supercomputing
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A comparison of high-order time integrators for thermal convection in rotating spherical shells
Journal of Computational Physics
Programming the Linpack benchmark for Roadrunner
IBM Journal of Research and Development
Computers & Mathematics with Applications
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Reducing the worst case running times of a family of RNA and CFG problems, using Valiant's approach
WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
Matrices as arrows!: a biproduct approach to typed linear algebra
MPC'10 Proceedings of the 10th international conference on Mathematics of program construction
PARFES: A method for solving finite element linear equations on multi-core computers
Advances in Engineering Software
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A fast GEMM implementation on the cypress GPU
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Fast implementation of DGEMM on Fermi GPU
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Upper and lower I/O bounds for pebbling r-pyramids
Journal of Discrete Algorithms
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of global address space applications: a case study with NWChem
Concurrency and Computation: Practice & Experience
Journal of Parallel and Distributed Computing
SAR image reconstruction and autofocus by compressed sensing
Digital Signal Processing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Elemental: A New Framework for Distributed Memory Dense Matrix Computations
ACM Transactions on Mathematical Software (TOMS)
Multi-core scalability measurements: issues and solutions
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Exploiting vector instructions with generalized stream fusio
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Typing linear algebra: A biproduct-oriented approach
Science of Computer Programming
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
A Basic Linear Algebra Compiler
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
In-place transposition of rectangular matrices on accelerators
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Building software environments for research computing clusters
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
The Journal of Supercomputing
Hi-index | 0.01 |
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.