Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
ICS '90 Proceedings of the 4th international conference on Supercomputing
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
An Adaptive Blocking Strategy for Matrix Factorizations
CONPAR 90/VAPP IV Proceedings of the Joint International Conference on Vector and Parallel Processing
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The Cray BlackWidow: a highly scalable vector multiprocessor
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Efficient gather and scatter operations on graphics processors
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Aspects of GPU for general purpose high performance computing
Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Performance analysis of accelerated image registration using GPGPU
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Computer Speech and Language
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Performance Optimization Strategies of High Performance Computing on GPU
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
GPU based sparse grid technique for solving multidimensional options pricing PDEs
Proceedings of the 2nd Workshop on High Performance Computational Finance
Triangular matrix inversion on Graphics Processing Unit
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing sparse matrix-vector multiplication on throughput-oriented processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel implementation of a financial application on a GPU
Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scaling LAPACK panel operations using parallel cache assignment
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A design case study: CPU vs. GPGPU vs. FPGA
MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
The Scalable Heterogeneous Computing (SHOC) benchmark suite
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Integrating neuromuscular and cyber systems for neural control of artificial legs
Proceedings of the 1st ACM/IEEE International Conference on Cyber-Physical Systems
State-of-the-art in heterogeneous computing
Scientific Programming
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of the 24th ACM International Conference on Supercomputing
Parallel implementation of Artificial Neural Network training for speech recognition
Pattern Recognition Letters
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
Reconfigurable real-time MIMO detector on GPU
Asilomar'09 Proceedings of the 43rd Asilomar conference on Signals, systems and computers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient implementation of GPU virtualization in high performance clusters
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
Compact data structure and scalable algorithms for the sparse grid technique
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Computing on multi-core platform: performance issues
Proceedings of the 2011 International Conference on Communication, Computing & Security
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Monte Carlo methods: a computational pattern for our pattern language
Proceedings of the 2010 Workshop on Parallel Programming Patterns
Register packing for cyclic reduction: a case study
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A fast GEMM implementation on the cypress GPU
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Accelerating GPU kernels for dense linear algebra
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A scalable high performant Cholesky factorization for multicore with GPU accelerators
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Improving accuracy for matrix multiplications on GPUs
Scientific Programming
International Journal of High Performance Computing Applications
Advances in Engineering Software
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Designing and dynamically load balancing hybrid LU for multi/many-core
Computer Science - Research and Development
Optimized HPL for AMD GPU and multi-core CPU usage
Computer Science - Research and Development
Bounding the effect of partition camping in GPU kernels
Proceedings of the 8th ACM International Conference on Computing Frontiers
Journal of Computational and Applied Mathematics
kNN query processing in metric spaces using GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A fully empirical autotuned dense QR factorization for multicore architectures
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Automatic OpenCL device characterization: guiding optimized kernel design
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimizing symmetric dense matrix-vector multiplication on GPUs
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A CPU-GPU hybrid approach for the unsymmetric multifrontal method
Parallel Computing
Efficient Parallel Nonnegative Least Squares on Multicore Architectures
SIAM Journal on Scientific Computing
GPU-based single-cluster algorithm for the simulation of the Ising model
Journal of Computational Physics
Portable and scalable FPGA-based acceleration of a direct linear system solver
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimizing sweep3d for graphic processor unit
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Automatically tuning sparse matrix-vector multiplication for GPU architectures
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Toward techniques for auto-tuning GPU algorithms
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
GPU programming in a high level language: compiling X10 to CUDA
Proceedings of the 2011 ACM SIGPLAN X10 Workshop
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
Proceedings of the 9th conference on Computing Frontiers
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Spherical harmonic transform with GPUs
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Design patterns for scientific computations on sparse matrices
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Extending a highly parallel data mining algorithm to the intel ® many integrated core architecture
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
Interference-driven resource management for GPU-based heterogeneous clusters
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Journal of Computational Physics
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Acoustic scattering solver based on single level FMM for multi-GPU systems
Journal of Parallel and Distributed Computing
International Journal of High Performance Computing Applications
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Invariants of distance k-graphs for graph embedding
Pattern Recognition Letters
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Layout-oblivious compiler optimization for matrix computations
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA
International Journal of Grid and High Performance Computing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
Systematic approach in optimizing numerical memory-bound kernels on GPU
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
GPU implementation of a novel hybrid lattice Boltzmann method for non-isothermal flows
Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
Proceedings of the 27th international ACM conference on International conference on supercomputing
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
Complexity of the path avoiding forbidden pairs problem revisited
Discrete Applied Mathematics
A large-scale cross-architecture evaluation of thread-coarsening
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Assessing the performance of OpenMP programs on the intel xeon phi
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Efficient 3D stencil computations using CUDA
Parallel Computing
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
HARP: Harnessing inactive threads in many-core processors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
International Journal of Computational Science and Engineering
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Toward GPU accelerated topology optimization on unstructured meshes
Structural and Multidisciplinary Optimization
Journal of Real-Time Image Processing
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms
Scientific Programming
Numerical integration on GPUs for higher order finite elements
Computers & Mathematics with Applications
A low-cost 3D human interface device using GPU-based optical flow algorithms
Integrated Computer-Aided Engineering
Hi-index | 0.01 |
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.