CUDA-level performance with python-level productivity for Gaussian mixture model applications

Authors:
H. Cook;E. Gonina;S. Kamil;G. Friedland;D. Patterson;A. Fox
Affiliations:
Parallel Computing Laboratory, University of California at Berkeley;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;International Computer Science Institute, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California
Venue:
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Year:
2011

Citing 7
Cited 2

Building domain-specific embedded languages

ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Space-limited procedures: a methodology for portable high-performance

PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
The java hotspotTM server compiler

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA

HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Auto-tuning performance on multicore computers

Auto-tuning performance on multicore computers
A domain-specific approach to heterogeneous parallelism

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Portable parallel performance from sequential, productive, embedded domain-specific languages

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable multimedia content analysis on parallel platforms using python

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by expert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mapping a problem onto parallel hardware depending on the input data size and the hardware parameters. We show how to preserve the productivity of high-level languages while obtaining the performance of the best low-level language code variant for a given hardware platform and problem size using SEJITS, a set of techniques that leverages just-in-time code generation and compilation. As a case study, we demonstrate our technique for Gaussian Mixture Model training using the EM algorithm. With the addition of one line of code to import our framework, a domain programmer using an existing Python GMM library can run her program unmodified on a GPU-equipped computer and achieve performance that meets or beats GPU code hand-crafted by a human expert. We also show that despite the overhead of allowing the domain expert's program to use Python and the overhead of just-in-time code generation and compilation, our approach still results in performance competitive with hand-crafted GPU code.