Efficient index for retrieving top-k most frequent documents

Authors:
Wing-Kai Hon;Manish Patil;Rahul Shah;Shih-Bin Wu
Affiliations:
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan;Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA;Department of Computer Science, Louisiana State University, Baton Rouge, LA, USA;Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Venue:
Journal of Discrete Algorithms
Year:
2010

Citing 14
Cited 5

A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A fast string searching algorithm

Communications of the ACM
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Introduction to Algorithms

Introduction to Algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Augmenting Suffix Trees, with Applications

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Rank-Sensitive data structures

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Position-Restricted substring searching

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics

Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the document retrieval problem (Muthukrishnan, 2002), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top-k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF (Term Frequency-Inverse Document Frequency) metric. A related problem was studied by Muthukrishnan (2002) where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be reported as the output. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(|P|+logDloglogD+k) time and taking O(DlogD) space. Our approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). The practicality of the proposed index is validated by the experimental results.