Distributed tuning of machine learning algorithms using MapReduce Clusters

  • Authors:
  • Yasser Ganjisaffar;Thomas Debeauvais;Sara Javanmardi;Rich Caruana;Cristina Videira Lopes

  • Affiliations:
  • University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;Microsoft Research Redmond, WA;University of California, Irvine, Irvine, CA

  • Venue:
  • Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.02

Visualization

Abstract

Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.