CUDA-level performance with python-level productivity for Gaussian mixture model applications
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Language identification using multi-core processors
Computer Speech and Language
Detecting multiple stochastic network motifs in network data
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Unexpected challenges in large scale machine learning
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Massively parallel expectation maximization using graphics processing units
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A fast convergence clustering algorithm merging MCMC and EM methods
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Expectation Maximization (EM) algorithm is an iterative technique widely used in the fields of signal processing and data mining. We present a parallel implementation of EM for finding maximum likelihood estimates of parameters of Gaussian mixture models, designed for many-core architecture of Graphics Processing Units (GPU). The algorithm is implemented on NVIDIA's GPUs using CUDA, following the single instruction multiple threads model. In this paper, the emphasis is laid on exploiting the data parallelism with CUDA, thus accelerating the computations. CUDA implementation of EM is designed in such a way that the speed of computation of the algorithm scales up with the number of GPU cores. Experimental results confirm the scalability across cores. The results also show that CUDA implementation of EM when applied to an input of 230K for a 32-order mixture of 32-dimensional Gaussian model takes 264 msec on Quadro FX 5800 (NVIDIA 200 series) with 240 cores to complete one iteration which is about 164 times faster when compared to a naive single threaded C implementation on CPU.