Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing
EURASIP Journal on Wireless Communications and Networking
Efficient 3D stencil computations using CUDA
Parallel Computing
Hi-index | 0.00 |
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.