Practical compressed document retrieval

Authors:
Gonzalo Navarro;Simon J. Puglisi;Daniel Valenzuela
Affiliations:
Dept. of Computer Science, University of Chile;School of Computer Science and Information Technology, Royal Melbourne Institute of Technology;Dept. of Computer Science, University of Chile
Venue:
SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Year:
2011

Citing 19
Cited 10

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
Colored range queries and document retrieval

Theoretical Computer Science
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research on document retrieval for general texts has established the virtues of explicitly representing the so-called document array, which stores the document each pointer of the suffix array belongs to. While it makes document retrieval faster, this array occupies a significative amount of redundant space and is not easily compressible. In this paper we present the first practical proposal to compress the document array. We show that the resulting structure is significatively smaller than the uncompressed counterpart, and than alternatives to the document array proposed in the literature. We also compare various known algorithms for document listing and top-k retrieval, and find that the most useful combinations of algorithms run over our new compressed document arrays.