Sampling Algorithms and Coresets for $\ell_p$ Regression

  • Authors:
  • Anirban Dasgupta;Petros Drineas;Boulos Harb;Ravi Kumar;Michael W. Mahoney

  • Affiliations:
  • anirban@yahoo-inc.com and ravikumar@yahoo-inc.com;drinep@cs.rpi.edu;harb@google.com;-;mmahoney@cs.stanford.edu

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The $\ell_p$ regression problem takes as input a matrix $A\in\mathbb{R}^{n\times d}$, a vector $b\in\mathbb{R}^n$, and a number $p\in[1,\infty)$, and it returns as output a number ${\cal Z}$ and a vector $x_{\text{{\sc opt}}}\in\mathbb{R}^d$ such that ${\cal Z}=\min_{x\in\mathbb{R}^d}\|Ax-b\|_p=\|Ax_{\text{{\sc opt}}}-b\|_p$. In this paper, we construct coresets and obtain an efficient two-stage sampling-based approximation algorithm for the very overconstrained ($n\gg d$) version of this classical problem, for all $p\in[1, \infty)$. The first stage of our algorithm nonuniformly samples $\hat{r}_1=O(36^p d^{\max\{p/2+1,p\}+1})$ rows of $A$ and the corresponding elements of $b$, and then it solves the $\ell_p$ regression problem on the sample; we prove this is an 8-approximation. The second stage of our algorithm uses the output of the first stage to resample $\hat{r}_1/\epsilon^2$ constraints, and then it solves the $\ell_p$ regression problem on the new sample; we prove this is a $(1+\epsilon)$-approximation. Our algorithm unifies, improves upon, and extends the existing algorithms for special cases of $\ell_p$ regression, namely, $p = 1,2$ [K. L. Clarkson, in Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, SIAM, Philadelphia, 2005, pp. 257-266; P. Drineas, M. W. Mahoney, and S. Muthukrishnan, in Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, SIAM, Philadelphia, 2006, pp. 1127-1136]. In the course of proving our result, we develop two concepts—well-conditioned bases and subspace-preserving sampling—that are of independent interest.