Avoiding communication through a multilevel LU factorization

Authors:
Simplice Donfack;Laura Grigori;Amal Khabou
Affiliations:
INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Université Paris-Sud, France;INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Université Paris-Sud, France;INRIA Saclay-Ile de France, Laboratoire de Recherche en Informatique, Université Paris-Sud, France
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Year:
2012

Citing 9
Cited 0

LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy. In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.