Space-Efficient Framework for Top-k String Retrieval Problems

Authors:
Wing-Kai Hon;Rahul Shah;Jeffrey Scott Vitter
Affiliations:
-;-;-
Venue:
FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Year:
2009

Citing 0
Cited 28

Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Top-K color queries for document retrieval

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Forbidden patterns

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms
Colored range queries and document retrieval

Theoretical Computer Science
Space-efficient data structures for Top-k completion

Proceedings of the 22nd international conference on World Wide Web
Top-k join queries: overcoming the curse of anti-correlation

Proceedings of the 17th International Database Engineering & Applications Symposium
Colored top-K range-aggregate queries

Information Processing Letters
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a set ${\cal D}=\{d_1, d_2,..., d_D\}$ of $D$strings of total length $n$, our task is to report the "most relevant"strings for a given query pattern $P$. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by [Muthukrishnan, 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking $O(n \logn)$ words of space. We study this problem in a slightly different framework of reporting the top $k$ most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.