Estimators and tail bounds for dimension reduction in lα (0

  • Authors:
  • Ping Li

  • Affiliations:
  • Cornell University, Ithaca, NY

  • Venue:
  • Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The method of stable random projections is popular in data stream computations, data mining, information retrieval, and machine learning, for efficiently computing the lα (0 We propose algorithms based on (1) the geometric mean estimator, for all 0 harmonic mean estimator, only for small α (e.g., α • The general sample complexity bound for α ≠ 1,2. For α = 1, [27] provided a nice argument based on the inverse of Cauchy density about the median, leading to a sample complexity bound, although they did not provide the constants and their proof restricted ε to be "small enough." For general α ≠ 1, 2, however, the task becomes much more difficult. [27] provided the "conceptual promise" that the sample complexity bound similar to that for α = 1 should exist for general α, if a "non-uniform algorithm based on t-quantile" could be implemented. Such a conceptual algorithm was only for supporting the arguments in [27], not a real implementation. We consider this is one of the main problems left open in [27]. In this study, we propose a practical algorithm based on the geometric mean estimator and derive the sample complexity bound for all 0 • The practical and optimal algorithm for α = 0+ The l0 norm is an important case. Stable random projections can provide an approximation to the l0 norm using α → 0+. We provide an algorithm based on the harmonic mean estimator, which is simple and statistically optimal. Its tail bounds are sharper than the bounds derived based on the geometric mean. We also discover a (possibly surprising) fact: in boolean data, stable random projections using α = 0+ with the harmonic mean estimator will be about twice as accurate as (l2) normal random projections. Because high-dimensional boolean data are common, we expect this fact will be practically quite useful. • The precise theoretical analysis and practical implications We provide the precise constants in the tail bounds for both the geometric mean and harmonic mean estimators. We also provide the variances (either exact or asymptotic) for the proposed estimators. These results can assist practitioners to choose sample sizes accurately.