Approximating a gram matrix for improved kernel-based learning

  • Authors:
  • Petros Drineas;Michael W. Mahoney

  • Affiliations:
  • Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York;Department of Mathematics, Yale University, New Haven, CT

  • Venue:
  • COLT'05 Proceedings of the 18th annual conference on Learning Theory
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of the form ${\tilde G}_{k} = CW^{+}_{k}C^{T}$, where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let || ·||2 and || ·||F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4) columns $${\left\|G - CW^{+}_{k}C^{T}\right\|_{\xi}} \leq \|A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of the form ${\tilde G}_{k} = CW^{+}_{k}C^{T}$, where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let || ·||2 and || ·||F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4) columns $${\left\|G - CW^{+}_{k}C^{T}\right\|_{\xi}} \leq \|G - G_{k}\|_{\xi} + \sum\limits_{i=1}^{n} G^{2}_{ii},$$ both in expectation and with high probability, for both ξ = 2,F, and for all k : 0 ≤k≤ rank(W). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage. |_{\xi} + \sum\limits_{i=1}^{n} G^{2}_{ii},$$ both in expectation and with high probability, for both ξ = 2,F, and for all k : 0 ≤k≤ rank(W). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage.