Performance of CUDA Virtualized Remote GPUs in High Performance Clusters

Authors:
Jose Duato;Antonio J. Pena;Federico Silla;Rafael Mayo;Enrique S. Quintana-Orti
Affiliations:
-;-;-;-;-
Venue:
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Year:
2011

Citing 0
Cited 1

Scaling analytics applications with OpenCL for loosely coupled heterogeneous clusters

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a previous work we presented the architecture of rCUDA, a middleware that enables CUDA remoting over a commodity network. That is, the middleware allows an application to use a CUDA-compatible Graphics Processor (GPU) installed in a remote computer as if it were installed in the computer where the application is being executed. This approach is based on the observation that GPUs in a cluster are not usually fully utilized, and it is intended to reduce the number of GPUs in the cluster, thus lowering the costs related with acquisition and maintenance while keeping performance close to that of the fully-equipped configuration. In this paper we model rCUDA over a series of high throughput networks in order to assess the influence of the performance of the underlying network on the performance of our virtualization technique. For this purpose, we analyze the traces of two different case studies over two different networks. Using this data, we calculate the expected performance for these same case studies over a series of high throughput networks, in order to characterize the expected behavior of our solution in high performance clusters. The estimations are validated using real 1 Gbps Ethernet and 40 Gbps InfiniBand networks, showing an error rate in the order of 1% for executions involving data transfers above 40 MB. In summary, although our virtualization technique noticeably increases execution time when using a 1 Gbps Ethernet network, it performs almost as efficiently as a local GPU when higher performance interconnects are used. Therefore, the small overhead incurred by our proposal because of the remote use of GPUs is worth the savings that a cluster configuration with less GPUs than nodes reports.