Communication-efficient algorithms for statistical optimization

Authors:
Yuchen Zhang;John C. Duchi;Martin J. Wainwright
Affiliations:
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA;Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA;Department of Statistics, University of California, Berkeley, Berkeley, CA
Venue:
The Journal of Machine Learning Research
Year:
2013

Citing 9
Cited 0

Acceleration of stochastic approximation by averaging

SIAM Journal on Control and Optimization
Convex Optimization

Convex Optimization
Introduction to Information Retrieval

Introduction to Information Retrieval
Robust Stochastic Approximation Approach to Stochastic Programming

SIAM Journal on Optimization
Distributed training strategies for the structured perceptron

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems

SIAM Journal on Optimization
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Logarithmic regret algorithms for online convex optimization

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Optimal distributed online prediction using mini-batches

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error (MSE) that decays as O(N-1 +(N/m)-2). Whenever m ≤ √N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O(N-1 + (N/m)-3), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O(N-1 + (N/m)-3/2), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4×108 samples and d ≈ 740,000 covariates.