Optimal outlier removal in high-dimensional spaces

  • Authors:
  • John Dunagan;Santosh Vempala

  • Affiliations:
  • Department of Mathematics, MIT, Cambridge MA;Department of Mathematics, MIT, Cambridge MA

  • Venue:
  • Journal of Computer and System Sciences - STOC 2001
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. As in [BFKV 99], a point x is defined to be a β-outlier if there exists some direction w in which its squared distance from the mean along w is greater than β times the average squared distance from the mean along w. Our main theorem is that for any ε 0, there exists a (1 - ε) fraction of the original distribution that has no O(n/ε(b + logn/ε))-outliers, improving on the previous bound of O(n7b/ε). This is asymptotically the best possible, as shown by a matching lower bound. The theorem is constructive, and results in a 1/1-ε approximation to the following optimization problem: given a distribution µ (i.e. the ability to sample from it), and a parameter ε 0, find the minimum β for which there exists a subset of probability at least (1 - ε) with no β-outliers.