A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality

Authors:
Christoph Müller;Steffen Frey;Magnus Strengert;Carsten Dachsbacher;Thomas Ertl
Affiliations:
Visualisierungsinstitut der Universität Stuttgart;Visualisierungsinstitut der Universität Stuttgart;Visualisierungsinstitut der Universität Stuttgart;Visualisierungsinstitut der Universität Stuttgart;Visualisierungsinstitut der Universität Stuttgart
Venue:
IEEE Transactions on Visualization and Computer Graphics
Year:
2009

Citing 0
Cited 4

Parallelized computation for computer simulation of electrocardiograms using personal computers with multi-core CPU and general-purpose GPU

Computer Methods and Programs in Biomedicine
PaTraCo: a framework enabling the transparent and efficient programming of heterogeneous compute networks

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Load balancing utilizing data redundancy in distributed volume rendering

EG PGV'11 Proceedings of the 11th Eurographics conference on Parallel Graphics and Visualization
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware—including the concept of global memory—on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters.