Efficient in-memory top-k document retrieval

Authors:
J. Shane Culpepper;Matthias Petri;Falk Scholer
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 38
Cited 3

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Fast evaluation of structured queries for information retrieval

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Compressing Inverted Files

Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Succinct static data structures

Succinct static data structures
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Binary Codes for Non-Uniform Sources

DCC '05 Proceedings of the Data Compression Conference
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Indexing compressed text

Journal of the ACM (JACM)
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
The myriad virtues of Wavelet Trees

Information and Computation
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Enhanced byte codes with restricted prefix properties

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms

Reordering an index to speed query processing without loss of effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiency-bound search tasks can now easily be supported entirely in memory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for in-memory bag of-words ranked document retrieval using a self-index derived from the FM-Index, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic trade-offs for performing efficient queries entirely in-memory. We compare our approach with two classic approaches to bag-of-words queries using inverted indexes, term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing. We show that our framework is competitive with state-of-the-art indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.