Compression, indexing, and retrieval for massive string data
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Top-k ranked document search in general text databases
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Compressed self-indices supporting conjunctive queries on document collections
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String retrieval for multi-pattern queries
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Practical compressed document retrieval
SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Improved compressed indexes for full-text document retrieval
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed indexes for aligned pattern matching
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
SIAM Journal on Computing
Word-based self-indexes for natural language text
ACM Transactions on Information Systems (TOIS)
Top-k document retrieval in optimal time and linear space
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Top-K color queries for document retrieval
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Efficient in-memory top-k document retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval
SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Towards an optimal space-and-query-time index for top-k document retrieval
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Compressed data structures with relevance
Proceedings of the 21st ACM international conference on Information and knowledge management
Improved compressed indexes for full-text document retrieval
Journal of Discrete Algorithms
ESP-index: A compressed index based on edit-sensitive parsing
Journal of Discrete Algorithms
Colored range queries and document retrieval
Theoretical Computer Science
Space-efficient data structures for Top-k completion
Proceedings of the 22nd international conference on World Wide Web
Top-k join queries: overcoming the curse of anti-correlation
Proceedings of the 17th International Database Engineering & Applications Symposium
Colored top-K range-aggregate queries
Information Processing Letters
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Indexing Word Sequences for Ranked Retrieval
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Given a set ${\cal D}=\{d_1, d_2,..., d_D\}$ of $D$strings of total length $n$, our task is to report the "most relevant"strings for a given query pattern $P$. This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents. The first formal framework to study such kind of retrieval problems was given by [Muthukrishnan, 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures taking $O(n \logn)$ words of space. We study this problem in a slightly different framework of reporting the top $k$ most relevant documents (in sorted order) under similar and more general relevance metrics. Our framework gives linear space data structure with optimal query times for arbitrary score functions. As a corollary, it improves the space utilization for the problems in [Muthukrishnan, 2002] while maintaining optimal query performance. We also develop compressed variants of these data structures for several specific relevance metrics.