Applying recursion to serial and parallel QR factorization leads to better performance

  • Authors:
  • E. Elmroth;F. G. Gustavson

  • Affiliations:
  • Department of Computing Science and High Performance Computing Center North, Umeå University, Umeå, Sweden;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2000

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000® SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC® 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node. Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.