Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Authors:
Wenjing Ma;Sriram Krishnamoorthy;Oreste Villa;Karol Kowalski;Gagan Agrawal
Affiliations:
Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, USA;Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, USA;Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, USA;Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, Richland, USA and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richla ...;Department of Computer Sciences and Engineering, The Ohio State University, Columbus, USA
Venue:
Cluster Computing
Year:
2013

Citing 23
Cited 0

High Performance Remote Memory Access Communication: The Armci Approach

International Journal of High Performance Computing Applications
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Scalable Parallel Programming with CUDA

Queue - GPU Computing
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A translation system for enabling data mining applications on GPUs

Proceedings of the 23rd international conference on Supercomputing
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Liquid water: obtaining the right answer for the right reasons

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scalable implementations of accurate excited-state coupled cluster theories: application of high-level methods to porphyrin-based systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.