Efficient algorithms for document retrieval problems

Authors:
S. Muthukrishnan
Affiliations:
AT&T Labs --- Research, Florham Park, NJ
Venue:
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2002

Citing 13
Cited 71

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
An algorithm for string matching with a sequence of don't cares

Information Processing Letters
Further results on generalized intersection searching problems: counting, reporting, and dynamization

Journal of Algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The subtree max gap problem with application to parallel string covering

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Two-dimensional substring indexing

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Scalable frequent-pattern mining methods: an overview

Tutorial notes of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Deterministic dictionaries

Journal of Algorithms
Modern Information Retrieval

Modern Information Retrieval
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Scaling and related techniques for geometry problems

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing

Space-Efficient Data Structures for Flexible Text Retrieval Systems

ISAAC '02 Proceedings of the 13th International Symposium on Algorithms and Computation
Range Searching in Categorical Data: Colored Range Searching on Grid

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
An optimal algorithm for querying tree structures and its applications in bioinformatics

ACM SIGMOD Record
Algorithmic foundations of the internet

ACM SIGACT News
Optimal and near-optimal algorithms for generalized intersection reporting on pointer machines

Information Processing Letters
Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Proceedings of the 15th international conference on World Wide Web
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed indexes for approximate string matching

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Languages with mismatches

Theoretical Computer Science
Text indexing with errors

Journal of Discrete Algorithms
Faster path indexes for search in XML data

ADC '08 Proceedings of the nineteenth conference on Australasian database - Volume 75
Property matching and weighted matching

Theoretical Computer Science
Approximate colored range and point enclosure queries

Journal of Discrete Algorithms
Optimal prefix and suffix queries on texts

Information Processing Letters
Indexing Factors with Gaps

SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient Index for Retrieving Top-k Most Frequent Documents

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Optimal and near-optimal algorithms for generalized intersection reporting on pointer machines

Information Processing Letters
Indexing structures for approximate string matching

CIAC'03 Proceedings of the 5th Italian conference on Algorithms and complexity
Note: Fast set intersection and two-patterns matching

Theoretical Computer Science
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Efficient index for retrieving top-k most frequent documents

Journal of Discrete Algorithms
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Counting colours in compressed strings

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A new efficient indexing algorithm for one-dimensional real scaled patterns

Journal of Computer and System Sciences
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Efficient non-intersection queries on aggregated geometric data

COCOON'05 Proceedings of the 11th annual international conference on Computing and Combinatorics
Optimal succinctness for range minimum queries

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Fast set intersection and two-patterns matching

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Approximate colored range queries

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Languages with mismatches and an application to approximate indexing

DLT'05 Proceedings of the 9th international conference on Developments in Language Theory
Top-K color queries for document retrieval

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Text indexing with errors

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Search engines and web information retrieval

CAAN'04 Proceedings of the First international conference on Combinatorial and Algorithmic Aspects of Networking
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Rank-Sensitive data structures

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Space-efficient range reporting for categorical data

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Forbidden patterns

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Cross-Document pattern matching

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Space-efficient algorithms for document retrieval

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
Being picky: processing top-k queries with set-defined selections

Proceedings of the 21st ACM international conference on Information and knowledge management
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Computing discriminating and generic words

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
On compressing and indexing repetitive sequences

Theoretical Computer Science
Colored range queries and document retrieval

Theoretical Computer Science
Space-efficient data structures for Top-k completion

Proceedings of the 22nd international conference on World Wide Web
Better space bounds for parameterized range majority and minority

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Efficient error-tolerant query autocompletion

Proceedings of the VLDB Endowment
Efficient range searching for categorical and plain data

ACM Transactions on Database Systems (TODS)
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)
Compact binary relation representations with rich functionality

Information and Computation
Cross-document pattern matching

Journal of Discrete Algorithms
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.01

Visualization

Abstract

We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.