BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Hybrid OpenCL: Enhancing OpenCL for Distributed Processing
ISPA '11 Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
Performance of CUDA Virtualized Remote GPUs in High Performance Clusters
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Performance characterization of the NAS Parallel Benchmarks in OpenCL
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Numerical Analysis for Statisticians
Numerical Analysis for Statisticians
clOpenCL: supporting distributed heterogeneous computing in HPC clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Hi-index | 0.00 |
OpenCL is an open standard for heterogeneous parallel programming, exploiting multi-core CPUs, GPUs, or other accelerators as parallel computing resources. Recent work has extended the OpenCL parallel programming model for distributed heterogeneous clusters. For such loosely coupled acceleration architectures, the design of OpenCL programs to maximize performance is quite different from that of conventional tightly coupled acceleration platforms. This paper describes our experiences in OpenCL programming to extract scalable performance for a distributed heterogeneous cluster environment. We picked two real-world analytics workloads, Two-Step Cluster and Linear Regression, that offer different challenges to efficient OpenCL implementations. We obtained scalable performance with this architecture by carefully managing the amount of data and computations in the kernel program design and by well addressing the network latency problems through optimizations.