A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A fast string searching algorithm
Communications of the ACM
Succinct representations of lcp information and improvements in the compressed suffix arrays
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Augmenting Suffix Trees, with Applications
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Space-Efficient Algorithms for Document Retrieval
CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Rank-Sensitive data structures
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Position-Restricted substring searching
LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
String retrieval for multi-pattern queries
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-Efficient top-k document retrieval
SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Colored range queries and document retrieval
Theoretical Computer Science
Hi-index | 0.00 |
In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P , we can identify which documents in the collection contain P . In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval , where instead of listing all documents containing P , our focus is to identify the top k documents having most occurrences of P . This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF metric. A related problem was studied by [9] where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f . However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be outputted. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O (P + logD loglogD + k ) time and taking O (D logD ) space. Our approach is based on novel use of the suffix tree called induced generalized suffix tree (IGST).