Improving Performance of Matrix Multiplication and FFT on GPU

Authors:
Xiang Cui;Yifeng Chen;Hong Mei
Affiliations:
-;-;-
Venue:
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Year:
2009

Citing 0
Cited 3

Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing

EURASIP Journal on Wireless Communications and Networking
Efficient 3D stencil computations using CUDA

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.