Locality-improved FFT implementation on a graphics processor

Authors:
Sergio Romero;Maria A. Trenas;Eladio Gutierrez;Emilio L. Zapata
Affiliations:
Department of Computer Architecture, University of Málaga, Málaga, Spain;Department of Computer Architecture, University of Málaga, Málaga, Spain;Department of Computer Architecture, University of Málaga, Málaga, Spain;Department of Computer Architecture, University of Málaga, Málaga, Spain
Venue:
ISCGAV'07 Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision
Year:
2007

Citing 3
Cited 3

The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
FFT and Convolution Performance in Image Filtering on GPU

IV '06 Proceedings of the conference on Information Visualization
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Cache simulator based on GPU acceleration

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Comparison of design and performance of snow cover computing on GPUs and multi-core processors

WSEAS Transactions on Information Science and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing computational power of modern graphics processing units is making them very suitable for general purpose computing. These commodity processors operate generally as parallel SIMD platforms and, among other factors, the effectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy. This paper deals with the implementation of the Fast Fourier Transform on a novel graphics architecture offered recently by NVIDIA. Such an implementation takes into consideration memory reference locality issues, that are crucial when pursuing a high degree of parallelism, that is, a good occupancy of the processing elements. The proposed implementation has been tested and compared to the manufacturer's own implementation.