Factorization-based lossless compression of inverted indices

Authors:
George Beskales;Marcus Fontoura;Maxim Gurevich;Sergei Vassilvitskii;Vanja Josifovski
Affiliations:
University of Waterloo, Waterloo, ON, Canada;Google Inc., Mountain View, CA, USA;Yahoo! Labs, Santa Clara, CA, USA;Yahoo! Labs, New York, NY, USA;Yahoo! Labs, Santa Clara, CA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 13
Cited 0

Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
Latent dirichlet allocation

The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Non-negative Matrix Factorization with Sparseness Constraints

The Journal of Machine Learning Research
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Google's MapReduce programming model — Revisited

Science of Computer Programming
Introduction to Information Retrieval

Introduction to Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Proceedings of the 19th international conference on World wide web
Greed is good: algorithmic results for sparse approximation

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach.