Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Using PLAPACK: parallel linear algebra package
Using PLAPACK: parallel linear algebra package
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication
IEEE Transactions on Parallel and Distributed Systems
MaTRiX+/sup +/: an object-oriented environment for parallel high-performance matrix computations
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
BLAS Based on Block Data Structures
BLAS Based on Block Data Structures
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces
ACM Transactions on Mathematical Software (TOMS)
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
OpenMP issues arising in the development of parallel BLAS and LAPACK libraries
Scientific Programming - OpenMP
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Mechanical derivation and systematic analysis of correct linear algebra algorithms
Mechanical derivation and systematic analysis of correct linear algebra algorithms
Scalable parallelization of FLAME code via the workqueuing model
ACM Transactions on Mathematical Software (TOMS)
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
Satisfying your dependencies with SuperMatrix
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Three algorithms for Cholesky factorization on distributed memory using packed storage
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization
High Performance Computing for Computational Science - VECPAR 2008
Programming matrix algorithms-by-blocks for thread-level parallelism
ACM Transactions on Mathematical Software (TOMS)
Managing the complexity of lookahead for LU factorization with pivoting
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Solving dense interval linear systems with verified computing on multicore architectures
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
Journal of Parallel and Distributed Computing
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the 51st ACM Southeast Conference
Towards a functional run-time for dense NLA domain
Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Dynamic load balancing on heterogeneous multi-GPU systems
Computers and Electrical Engineering
Hi-index | 0.00 |
This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.