Implementation and optimization of dense LU ecomposition on the stream processor

  • Authors:
  • Ying Zhang;Tao Tang;Gen Li;Xuejun Yang

  • Affiliations:
  • School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China;School of Computer, National University of Defense Technology, Changsha, China

  • Venue:
  • PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Developing scientific computing applications on the stream processor has absorbed a lot of researchers attention. In this paper, we implement and optimize dense LU decomposition on the stream processor. Different from other existing parallel algorithms for LU decomposition, StreamLUD algorithm aims at exploiting producerconsumer locality and at overlapping chip-off memory access with kernel execution. Simulation results show that dealing with matrices of different sizes, compared with LUD of HPL on an Itanium 2 processor, StreamLUD we implement and optimize gets a speedup from 2.56 to 3.64 ultimately.