Acceleration of stochastic approximation by averaging
SIAM Journal on Control and Optimization
Convex Optimization
Introduction to Information Retrieval
Introduction to Information Retrieval
Robust Stochastic Approximation Approach to Stochastic Programming
SIAM Journal on Optimization
Distributed training strategies for the structured perceptron
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems
SIAM Journal on Optimization
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Logarithmic regret algorithms for online convex optimization
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Optimal distributed online prediction using mini-batches
The Journal of Machine Learning Research
Hi-index | 0.00 |
We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error (MSE) that decays as O(N-1 +(N/m)-2). Whenever m ≤ √N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O(N-1 + (N/m)-3), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O(N-1 + (N/m)-3/2), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4×108 samples and d ≈ 740,000 covariates.