CUDA-level performance with python-level productivity for Gaussian mixture model applications

  • Authors:
  • H. Cook;E. Gonina;S. Kamil;G. Friedland;D. Patterson;A. Fox

  • Affiliations:
  • Parallel Computing Laboratory, University of California at Berkeley;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;International Computer Science Institute, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California;Parallel Computing Laboratory, University of California at Berkeley, Berkeley, California

  • Venue:
  • HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by expert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mapping a problem onto parallel hardware depending on the input data size and the hardware parameters. We show how to preserve the productivity of high-level languages while obtaining the performance of the best low-level language code variant for a given hardware platform and problem size using SEJITS, a set of techniques that leverages just-in-time code generation and compilation. As a case study, we demonstrate our technique for Gaussian Mixture Model training using the EM algorithm. With the addition of one line of code to import our framework, a domain programmer using an existing Python GMM library can run her program unmodified on a GPU-equipped computer and achieve performance that meets or beats GPU code hand-crafted by a human expert. We also show that despite the overhead of allowing the domain expert's program to use Python and the overhead of just-in-time code generation and compilation, our approach still results in performance competitive with hand-crafted GPU code.