A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
High Performance Computing for Computational Science - VECPAR 2008
Block Kalman Filtering for Large-Scale DSGE Models
Computational Economics
Large-scale deep unsupervised learning using graphics processors
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
C++ Bindings to External Software Libraries with Examples from BLAS, LAPACK, UMFPACK, and MUMPS
ACM Transactions on Mathematical Software (TOMS)
Biomedical Case Studies in Data Intensive Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A fast and robust mixed-precision solver for the solution of sparse symmetric linear systems
ACM Transactions on Mathematical Software (TOMS)
Spatial relationship preserving character motion adaptation
ACM SIGGRAPH 2010 papers
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fine tuning matrix multiplications on multicore
HiPC'08 Proceedings of the 15th international conference on High performance computing
Bundle adjustment in the large
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part II
High-performance reconfigurable hardware architecture for restricted Boltzmann machines
IEEE Transactions on Neural Networks
Performance models for the Spike banded linear system solver
Scientific Programming
Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization
ACM Transactions on Mathematical Software (TOMS)
Fast algorithms for floating-point interval matrix multiplication
Journal of Computational and Applied Mathematics
Fast static analysis of power grids: algorithms and implementations
Proceedings of the International Conference on Computer-Aided Design
Analytical bounds for optimal tile size selection
CC'12 Proceedings of the 21st international conference on Compiler Construction
Runtime detection and optimization of collective communication patterns
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-core scalability measurements: issues and solutions
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Interactive partner control in close interactions for real-time applications
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Exploiting vector instructions with generalized stream fusio
Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Harmonic parameterization by electrostatics
ACM Transactions on Graphics (TOG)
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Journal of Parallel and Distributed Computing
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.