Application-specific memory management for embedded systems using software-controlled caches
Proceedings of the 37th Annual Design Automation Conference
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
A performance-oriented data parallel virtual machine for GPUs
ACM SIGGRAPH 2006 Sketches
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A new look at the generalized distributive law
IEEE Transactions on Information Theory
Architecture-aware optimization targeting multithreaded stream computing
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Using many-core hardware to correlate radio astronomy signals
Proceedings of the 23rd international conference on Supercomputing
GridBot: execution of bags of tasks in multiple grids
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The virtual marathon: parallel computing supports crowd simulations
IEEE Computer Graphics and Applications - Special issue on non-photorealistic rendering a virtual environment for teaching social skills
Multi GPU implementation of iterative tomographic reconstruction algorithms
ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
High-throughput bayesian computing machine with reconfigurable hardware
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
A multiresolution approach to iterative reconstruction algorithms in x-ray computed tomography
IEEE Transactions on Image Processing
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Simple optimizations for an applicative array language for graphics processors
Proceedings of the sixth workshop on Declarative aspects of multicore programming
An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures
Proceedings of the 4th Annual International Conference on Systems and Storage
Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Technical Section: Realistic modeling of spectator behavior for soccer videogames with CUDA
Computers and Graphics
Safe and familiar multi-core programming by means of a hybrid functional and imperative language
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model
ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special Issue on Monte Carlo Methods in Statistics
Optimizing parallel belief propagation in junction treesusing regression
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
GPU code generation for ODE-based applications with phased shared-data access patterns
ACM Transactions on Architecture and Code Optimization (TACO)
GPU accelerated MCMC for modeling terrorist activity
Computational Statistics & Data Analysis
Hi-index | 0.00 |
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.