Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Authors:
E Wes Bethel;Mark Howison
Affiliations:
Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Center for Computation and Visualization, Brown University, Providence, RI, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2012

Citing 22
Cited 1

Display of Surfaces from Volume Data

IEEE Computer Graphics and Applications
Computer graphics: principles and practice (2nd ed.)

Computer graphics: principles and practice (2nd ed.)
Volume rendering on scalable shared-memory MIMD architectures

VVS '92 Proceedings of the 1992 workshop on Volume visualization
Segmented ray casting for data parallel volume rendering

PRS '93 Proceedings of the 1993 symposium on Parallel rendering
A data distributed, parallel algorithm for ray-traced volume rendering

PRS '93 Proceedings of the 1993 symposium on Parallel rendering
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures

PRS '95 Proceedings of the IEEE symposium on Parallel rendering
Multi-frame thrashless ray casting with advancing ray-front

GI '96 Proceedings of the conference on Graphics interface '96
Programming with POSIX threads

Programming with POSIX threads
A rendering algorithm for visualizing 3D scalar fields

SIGGRAPH '88 Proceedings of the 15th annual conference on Computer graphics and interactive techniques
V-buffer: visible volume rendering

SIGGRAPH '88 Proceedings of the 15th annual conference on Computer graphics and interactive techniques
Volume rendering

SIGGRAPH '88 Proceedings of the 15th annual conference on Computer graphics and interactive techniques
Ray Casting on Shared-Memory Architectures: Memory-Hierarchy Considerations in Volume Rendering

IEEE Concurrency
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets

The Visual Computer: International Journal of Computer Graphics
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Large data visualization on distributed memory multi-GPU clusters

Proceedings of the Conference on High Performance Graphics
Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture

IEEE Transactions on Visualization and Computer Graphics
Multi-GPU MapReduce on GPU Clusters

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

IEEE Transactions on Visualization and Computer Graphics
Optimized volume raycasting for graphics-hardware-based cluster systems

EG PGV'06 Proceedings of the 6th Eurographics conference on Parallel Graphics and Visualization
A simple and flexible volume rendering framework for graphics-hardware-based raycasting

VG'05 Proceedings of the Fourth Eurographics / IEEE VGTC conference on Volume Graphics

Geometry-preserving topological landscapes

Proceedings of the Workshop at SIGGRAPH Asia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.