The input/output complexity of sorting and related problems
Communications of the ACM
Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK's user's guide
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Access normalization: loop restructuring for NUMA computers
ACM Transactions on Computer Systems (TOCS)
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data-centric multi-level blocking
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The design and analysis of a cache architecture for texture mapping
Proceedings of the 24th annual international symposium on Computer architecture
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms
Journal of the ACM (JACM)
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Photon mapping on programmable graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Simulation of cloud dynamics on graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Visual simulation of ice crystal growth
Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation
Linear algebra operators for GPU implementation of numerical algorithms
ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
ACM SIGGRAPH 2004 Papers
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
UberFlow: a GPU-based particle engine
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
GPUTeraSort: high performance graphics co-processor sorting for large database management
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Cache-efficient numerical algorithms using graphics hardware
Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Efficient gather and scatter operations on graphics processors
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Locality-improved FFT implementation on a graphics processor
ISCGAV'07 Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache
Proceedings of the 22nd annual international conference on Supercomputing
Technical Section: Accelerated MIP based on GPU using block clipping and occlusion query
Computers and Graphics
Using reconfigurable logic to optimise GPU memory accesses
Proceedings of the conference on Design, automation and test in Europe
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AES Encryption Implementation and Analysis on Commodity Graphics Processing Units
CHES '07 Proceedings of the 9th international workshop on Cryptographic Hardware and Embedded Systems
Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware
Electronic Notes in Theoretical Computer Science (ENTCS)
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory Locality Exploitation Strategies for FFT on the CUDA Architecture
High Performance Computing for Computational Science - VECPAR 2008
Fast and scalable list ranking on the GPU
Proceedings of the 23rd international conference on Supercomputing
Experiences with Mapping Non-linear Memory Access Patterns into GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Design and implementation of a graphical user interface for stream-based distributed computing
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Drug design issues on the cell BE
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
An empirically tuned 2D and 3D FFT library on CUDA GPU
Proceedings of the 24th ACM International Conference on Supercomputing
Algorithm engineering: bridging the gap between algorithm theory and practice
Algorithm engineering: bridging the gap between algorithm theory and practice
Database compression on graphics processors
Proceedings of the VLDB Endowment
Auto-tuning of fast fourier transform on graphics processors
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Analysis of Parallel Algorithms for Energy Conservation with GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
An idiom-finding tool for increasing productivity of accelerators
Proceedings of the international conference on Supercomputing
Transactions on High-Performance Embedded Architectures and Compilers IV
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
GPURoofline: a model for guiding performance optimizations on GPUs
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Energy cost evaluation of parallel algorithms for multiprocessor systems
Cluster Computing
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Hi-index | 0.00 |
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.