Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Authors:
Jeremy S. Meredith;Gonzalo Alvarez;Thomas A. Maier;Thomas C. Schulthess;Jeffrey S. Vetter
Affiliations:
Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6173 Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6173 Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6173 Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6173 Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, 1 Bethel Valley Road, MS 6173 Oak Ridge, TN 37831, USA
Venue:
Parallel Computing
Year:
2009

Citing 13
Cited 5

Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Fast computation of database operations using graphics processors

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A portable runtime interface for multi-level memory hierarchies

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems

GPU accelerated simulations of bluff body flows using vortex particle methods

Journal of Computational Physics
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Quantifying NUMA and contention effects in multi-GPU systems

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Computational physics on graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support

Microelectronics Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The tradeoffs of accuracy and performance are as yet an unsolved problem when dealing with Graphics Processing Units (GPUs) as a general-purpose computation device. Their high performance and low cost makes them a desirable target for scientific computation, and new language efforts help address the programming challenges of data parallel algorithms and memory management. But the original task of GPUs - real-time rendering - has traditionally kept accuracy as a secondary goal, and sacrifices have sometimes been made as a result. In fact, the widely deployed hardware is generally capable of only single precision arithmetic, and even this accuracy is not necessarily equivalent to that of a commodity CPU. In this paper, we investigate the accuracy and performance characteristics of GPUs, including results from a preproduction double precision-capable GPU. We then accelerate the full Quantum Monte Carlo simulation code DCA++, similarly investigating its tolerance to the precision of arithmetic delivered by GPUs. The results show that while DCA++ has some sensitivity to the arithmetic precision, the single-precision GPU results were comparable to single-precision CPU results. Acceleration of the code on a fully GPU-enabled cluster showed that any remaining inaccuracy in GPU precision was negligible; sufficient accuracy was retained for scientifically meaningful results while still showing significant speedups.