LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed
International Journal of High Performance Computing Applications
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
Hi-index | 0.00 |
Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy. In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.