Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Impact of sharing-based thread placement on multithreaded architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Efficient matrix computations through hierarchical type specifications
Efficient matrix computations through hierarchical type specifications
Compiler and software distributed shared memory support for irregular applications
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A recursive formulation of Cholesky factorization of a matrix in packed storage
ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
Compiler Analysis for Irregular Problems in Fortran D
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
MaTRiX+/sup +/: an object-oriented environment for parallel high-performance matrix computations
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
BLAS Based on Block Data Structures
BLAS Based on Block Data Structures
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Parallel and fully recursive multifrontal sparse Cholesky
Future Generation Computer Systems - Special issue: Selected numerical algorithms
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
OpenMP issues arising in the development of parallel BLAS and LAPACK libraries
Scientific Programming - OpenMP
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Provably good multicore cache performance for divide-and-conquer algorithms
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
ACM Transactions on Mathematical Software (TOMS)
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization
High Performance Computing for Computational Science - VECPAR 2008
Design, Tuning and Evaluation of Parallel Multilevel ILU Preconditioners
High Performance Computing for Computational Science - VECPAR 2008
CellSs: Scheduling techniques to better exploit memory hierarchy
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Hierarchical Task-Based Programming With StarSs
International Journal of High Performance Computing Applications
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Applying the concurrent collections programming model to asynchronous parallel dense linear algebra
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion
ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Fine tuning matrix multiplications on multicore
HiPC'08 Proceedings of the 15th international conference on High performance computing
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Journal of Computational and Applied Mathematics
High-performance up-and-downdating via householder-like transformations
ACM Transactions on Mathematical Software (TOMS)
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Concurrency and Computation: Practice & Experience
ACM Transactions on Mathematical Software (TOMS)
Journal of Parallel and Distributed Computing
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High-Performance matrix multiply on a massively multithreaded fiteng1000 processor
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Elemental: A New Framework for Distributed Memory Dense Matrix Computations
ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Analysis of dependence tracking algorithms for task dataflow execution
ACM Transactions on Architecture and Code Optimization (TACO)
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Hi-index | 0.00 |
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-by-blocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach.