LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Tuning Strassen's matrix multiplication for memory efficiency
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Physically-based visual simulation on graphics hardware
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Linear algebra operators for GPU implementation of numerical algorithms
ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
KD-tree acceleration structures for a GPU raytracer
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Shader Performance Analysis on a Modern GPU Architecture
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ClawHMMER: A Streaming HMMer-Search Implementatio
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Machine Graphics & Vision International Journal
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Interactive Collision Detection for Deformable Models Using Streaming AABBs
IEEE Transactions on Visualization and Computer Graphics
Efficient video decoding on GPUs by point based rendering
GH '06 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Cache-efficient numerical algorithms using graphics hardware
Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Efficient gather and scatter operations on graphics processors
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Application development on hybrid systems
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Visions for application development on hybrid computing systems
Parallel Computing
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache
Proceedings of the 22nd annual international conference on Supercomputing
High-performance computing with desktop workstations
MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
High performance 2D and 3D FDTD solvers on GPUs
MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
A game loop architecture for the GPU used as a math coprocessor in real-time applications
Computers in Entertainment (CIE) - SPECIAL ISSUE: Media Arts
Using reconfigurable logic to optimise GPU memory accesses
Proceedings of the conference on Design, automation and test in Europe
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware
Electronic Notes in Theoretical Computer Science (ENTCS)
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
High Performance Computing for Computational Science - VECPAR 2008
Architecture-aware optimization targeting multithreaded stream computing
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors
Journal of Signal Processing Systems
Triangular matrix inversion on Graphics Processing Unit
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel implementation of a financial application on a GPU
Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
Solving path problems on the GPU
Parallel Computing
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
A code motion technique for accelerating general-purpose computation on the GPU
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Analysis of Parallel Algorithms for Energy Conservation with GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A fast GEMM implementation on the cypress GPU
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Multifrontal computations on GPUs and their multi-core hosts
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Journal of Computational and Applied Mathematics
Spiking neural P system simulations on a high performance GPU platform
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
An introduction to GPU accelerated surgical simulation
ISBMS'06 Proceedings of the Third international conference on Biomedical Simulation
Performance study of LU decomposition on the programmable GPU
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
GPU-based active contour segmentation using gradient vector flow
ISVC'06 Proceedings of the Second international conference on Advances in Visual Computing - Volume Part I
A single (unified) shader GPU microarchitecture for embedded systems
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
A high efficient on-chip interconnection network in SIMD CMPs
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Implementing survey propagation on graphics processing units
SAT'06 Proceedings of the 9th international conference on Theory and Applications of Satisfiability Testing
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
A spiking neural p system simulator based on CUDA
CMC'11 Proceedings of the 12th international conference on Membrane Computing
Parallel computing of 3D smoking simulation based on OpenCL heterogeneous platform
The Journal of Supercomputing
Learning hash codes for efficient content reuse detection
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Concurrent number cruncher: an efficient sparse linear solver on the GPU
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
GPURoofline: a model for guiding performance optimizations on GPUs
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Graphics hardware based efficient and scalable fuzzy c-means clustering
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Energy cost evaluation of parallel algorithms for multiprocessor systems
Cluster Computing
Hi-index | 0.00 |
Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.