Improving Performance of Matrix Multiplication and FFT on GPU

  • Authors:
  • Xiang Cui;Yifeng Chen;Hong Mei

  • Affiliations:
  • -;-;-

  • Venue:
  • ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.