Indexes for highly repetitive document collections

Authors:
Francisco Claude;Antonio Fariña;Miguel A. Martínez-Prieto;Gonzalo Navarro
Affiliations:
University of Waterloo, Waterloo, ON, Canada;University of A Coruña, A Coruña, Spain;University of Chile, Santiago, Chile;University of Chile, Santiago, Chile
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 21
Cited 3

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Application of Lempel-Ziv Factorization to the Approximation of Grammar-Based Compression

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Self-indexed Text Compression Using Straight-Line Programs

MFCS '09 Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science 2009
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On compressing and indexing repetitive sequences

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection. We also introduce compressed self-indexes in the comparison. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.