Distributed tuning of machine learning algorithms using MapReduce Clusters

Authors:
Yasser Ganjisaffar;Thomas Debeauvais;Sara Javanmardi;Rich Caruana;Cristina Videira Lopes
Affiliations:
University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;Microsoft Research Redmond, WA;University of California, Irvine, Irvine, CA
Venue:
Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Year:
2011

Citing 9
Cited 1

IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Random Forests

Machine Learning
Optimization by simulated annealing: A preliminary computational study for the TSP

WSC '83 Proceedings of the 15th conference on Winter Simulation - Volume 2
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Efficient optimization of support vector machine learning parameters for unbalanced datasets

Journal of Computational and Applied Mathematics
An empirical evaluation of deep architectures on problems with many factors of variation

Proceedings of the 24th international conference on Machine learning
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
LETOR: A benchmark collection for research on learning to rank for information retrieval

Information Retrieval

Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.