FFTs in external of hierarchical memory
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The periodic balanced sorting network
Journal of the ACM (JACM)
Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK's user's guide
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Access normalization: loop restructuring for NUMA computers
ACM Transactions on Computer Systems (TOCS)
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The design and analysis of a cache architecture for texture mapping
Proceedings of the 24th annual international symposium on Computer architecture
The influence of caches on the performance of sorting
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Sorting on a mesh-connected parallel computer
Communications of the ACM
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Introduction to Algorithms
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Photon mapping on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
ACM SIGGRAPH 2004 Papers
UberFlow: a GPU-based particle engine
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Journal of Parallel and Distributed Computing
Stream processing for fast and efficient rotated Haar-like features using rotated integral images
International Journal of Intelligent Systems Technologies and Applications
Fast and scalable list ranking on the GPU
Proceedings of the 23rd international conference on Supercomputing
ACM SIGGRAPH 2009 Courses
Energy-aware high performance computing with graphic processing units
HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
GPU-based FFT computation for multi-gigabit wirelessHD baseband processing
EURASIP Journal on Wireless Communications and Networking
Implementing p systems parallelism by means of GPUs
WMC'09 Proceedings of the 10th international conference on Membrane Computing
A direct method for optimal VLSI realization of deeply nested n-D loop problems
Microprocessors & Microsystems
Hi-index | 0.01 |
We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2-10x improvement in performance.