Optimal outlier removal in high-dimensional

  • Authors:
  • John Dunagan;Santosh Vempala

  • Affiliations:
  • Department of Mathematics, MIT, Cambridge, MA;Department of Mathematics, MIT, Cambridge, MA

  • Venue:
  • STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. A point x is defined to be a &bgr;-outlier if there exists some direction w in which its squared distance from the mean along w is greater than &bgr; times the average squared distance from the mean along w [1]. Our main theorem is that for any &egr;0, there exists a (1-&egr;) fraction of the original distribution that has no O(\frac{n}{&egr;}(b+log \frac{n}{&egr;))-outliers, improving on the previous bound of O(n^7b/&egr;). This bound is shown to be nearly the best possible. The theorem is constructive, and results in a \frac{1}{1-&egr;} approximation to the following optimization problem: given a distribution &mgr; (i.e. the ability to sample from it), and a parameter &egr;0, find the minimum &bgr; for which there exists a subset of probability at least (1-&egr;) with no &bgr;-outliers.