Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Optimal aggregation algorithms for middleware
Journal of Computer and System Sciences - Special issu on PODS 2001
The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Biclustering Algorithms for Biological Data Analysis: A Survey
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Non-negative Matrix Factorization with Sparseness Constraints
The Journal of Machine Learning Research
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Google's MapReduce programming model — Revisited
Science of Computer Programming
Introduction to Information Retrieval
Introduction to Information Retrieval
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
Proceedings of the 19th international conference on World wide web
Greed is good: algorithmic results for sparse approximation
IEEE Transactions on Information Theory
Hi-index | 0.00 |
Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific properties of our compression approach.