Top-k ranked document search in general text databases

Authors:
J. Shane Culpepper;Gonzalo Navarro;Simon J. Puglisi;Andrew Turpin
Affiliations:
School of Computer Science and Information Technology, RMIT Univ., Australia;Department of Computer Science, Univ. of Chile;School of Computer Science and Information Technology, RMIT Univ., Australia;School of Computer Science and Information Technology, RMIT Univ., Australia
Venue:
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Year:
2010

Citing 21
Cited 30

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A vector space model for automatic indexing

Communications of the ACM
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Modern Information Retrieval

Modern Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Efficient top-k queries for orthogonal ranges

WALCOM'11 Proceedings of the 5th international conference on WALCOM: algorithms and computation
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Computing the longest common prefix array based on the burrows-wheeler transform

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Space-efficient data-analysis queries on grids

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
FEMTO: fast search of large sequence collections

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Computing the burrows-wheeler transform of a string and its reverse

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Space-Efficient computation of maximal and supermaximal repeats in genome sequences

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms
Colored range queries and document retrieval

Theoretical Computer Science
Space-efficient data-analysis queries on grids

Theoretical Computer Science
Trends in suffix sorting: a survey of low memory algorithms

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)
Wavelet trees for all

Journal of Discrete Algorithms
Computing the Burrows-Wheeler transform of a string and its reverse in parallel

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.