Efficient computation of sum-products on GPUs through software-managed cache

Authors:
Mark Silberstein;Assaf Schuster;Dan Geiger;Anjul Patney;John D. Owens
Affiliations:
Technion - Israel Institute of Technology, Haifa, Israel;Technion - Israel Institute of Technology, Haifa, Israel;Technion - Israel Institute of Technology, Haifa, Israel;University of California, Davis, CA, USA;University of California, Davis, CA, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 9
Cited 21

Application-specific memory management for embedded systems using software-controlled caches

Proceedings of the 37th Annual Design Automation Conference
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
A performance-oriented data parallel virtual machine for GPUs

ACM SIGGRAPH 2006 Sketches
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A new look at the generalized distributive law

IEEE Transactions on Information Theory

Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Using many-core hardware to correlate radio astronomy signals

Proceedings of the 23rd international conference on Supercomputing
GridBot: execution of bags of tasks in multiple grids

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The virtual marathon: parallel computing supports crowd simulations

IEEE Computer Graphics and Applications - Special issue on non-photorealistic rendering a virtual environment for teaching social skills
Multi GPU implementation of iterative tomographic reconstruction algorithms

ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
High-throughput bayesian computing machine with reconfigurable hardware

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
A multiresolution approach to iterative reconstruction algorithms in x-ray computed tomography

IEEE Transactions on Image Processing
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Simple optimizations for an applicative array language for graphics processors

Proceedings of the sixth workshop on Declarative aspects of multicore programming
An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures

Proceedings of the 4th Annual International Conference on Systems and Storage
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Technical Section: Realistic modeling of spectator behavior for soccer videogames with CUDA

Computers and Graphics
Safe and familiar multi-core programming by means of a hybrid functional and imperative language

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special Issue on Monte Carlo Methods in Statistics
Optimizing parallel belief propagation in junction treesusing regression

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
GPU code generation for ODE-based applications with phased shared-data access patterns

ACM Transactions on Architecture and Code Optimization (TACO)
GPU accelerated MCMC for modeling terrorist activity

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.