Factorization-based lossless compression of inverted indices

  • Authors:
  • George Beskales;Marcus Fontoura;Maxim Gurevich;Sergei Vassilvitskii;Vanja Josifovski

  • Affiliations:
  • University of Waterloo, Waterloo, ON, Canada;Google Inc., Mountain View, CA, USA;Yahoo! Labs, Santa Clara, CA, USA;Yahoo! Labs, New York, NY, USA;Yahoo! Labs, Santa Clara, CA, USA

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach.