LAPACK's user's guide
Using MPI: portable parallel programming with the message-passing interface
Using MPI: portable parallel programming with the message-passing interface
ScaLAPACK user's guide
ARM System Developer's Guide: Designing and Optimizing System Software
ARM System Developer's Guide: Designing and Optimizing System Software
Graph theory: An algorithmic approach (Computer science and applied mathematics)
Graph theory: An algorithmic approach (Computer science and applied mathematics)
Achieving accurate and context-sensitive timing for code optimization
Software—Practice & Experience
The impact of multicore on math software
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A fully empirical autotuned dense QR factorization for multicore architectures
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Tiled QR factorization algorithms
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience
Computer Science - Research and Development
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
An approach of the QR factorization for tall-and-skinny matrices on multicore platforms
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The emergence and continuing use of multi-core architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multi-core architectures. We present in this paper a comparative study of PLASMA's performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines -- TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on one-sided linear algebra factorizations (LU, QR and Cholesky) and used multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). A performance improvement of 67% was for instance obtained on the Cholesky factorization of a matrix of order 4000, using 32 cores.