Compressed self-indices supporting conjunctive queries on document collections

Authors:
Diego Arroyuelo;Senén González;Mauricio Oyarzún
Affiliations:
Yahoo! Research Latin America, Santiago, Chile;Department of Computer Science, Universidad de Chile;Universidad de Santiago de Chile
Venue:
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Year:
2010

Citing 30
Cited 9

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Adaptive intersection and t-threshold problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Modern Information Retrieval

Modern Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Experiments on Adaptive Set Intersections for Text Retrieval Systems

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Succinct static data structures

Succinct static data structures
Representing Trees of Higher Degree

Algorithmica
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Succinct indexes for strings, binary relations and multi-labeled trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Succinct Representations of Arbitrary Graphs

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Rank/select on dynamic compressed sequences and applications

Theoretical Computer Science
Compressing and indexing labeled trees, with applications

Journal of the ACM (JACM)
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Fully-functional succinct trees

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Extended compact web graph representations

Algorithms and Applications

Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Distributed search based on self-indexed compressed text

Information Processing and Management: an International Journal
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ∧ ... ∧ tk can be answered in O(kδ log log |Σ|) adaptive time, where δ is the instance difficulty of the query, as defined by Barbay and Kenyon in their SODA'02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive intersection algorithm by Barbay and Kenyon takes O(kδ log nM/δ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio nM/δ is ω(log |Σ|).