Adding Compression to Block Addressing Inverted Indexes

Authors:
Gonzalo Navarro;Edleno Silva De Moura;Marden Neubert;Nivio Ziviani;Ricardo Baeza-Yates
Affiliations:
Department of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl;Department of Computer Science, Univ. Federal de Minas Gerais, Brazil. edleno@dcc.ufmg.br;Department of Computer Science, Univ. Federal de Minas Gerais, Brazil. marden@dcc.ufmg.br;Department of Computer Science, Univ. Federal de Minas Gerais, Brazil. nivio@dcc.ufmg.br;Department of Computer Science, Univ. of Chile, Chile. rbaeza@dcc.uchile.cl
Venue:
Information Retrieval
Year:
2000

Citing 21
Cited 40

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Inverted files

Information retrieval
A new approach to text searching

Communications of the ACM
Fast text searching: allowing errors

Communications of the ACM
Compression of indexes with full positional information in very large text databases

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
In situ generation of compressed inverted files

Journal of the American Society for Information Science
Fast searching on compressed text allowing errors

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Integrating contents and structure in text retrieval

ACM SIGMOD Record
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval

Modern Information Retrieval
Text Compression for Dynamic Document Databases

IEEE Transactions on Knowledge and Data Engineering
Fast Algorithms for Two Dimensional and Multiple Pattern Matching (Preliminary Version)

SWAT '90 Proceedings of the 2nd Scandinavian Workshop on Algorithm Theory
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Scalable Text Retrieval for Large Digital Libraries

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Linear Time Sorting of Skewed Distributions

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
GLIMPSE: a tool to search through entire file systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
WebGlimpse: combining browsing and searching

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Rank-preserving two-level caching for scalable search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
Matchsimile: a flexible approximate matching tool for searching proper names

Journal of the American Society for Information Science and Technology
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing Text Using the Ziv-Lempel Trie

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Comparing inverted files and signature files for searching a large lexicon

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Efficiently decodable and searchable natural language adaptive compression

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text: Research Articles

Software—Practice & Experience
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Using structural contexts to compress semistructured text collections

Information Processing and Management: an International Journal
User modeling for personalized Web search with self-organizing map: Research Articles

Journal of the American Society for Information Science and Technology
Efficient in-memory extensible inverted file

Information Systems
Locality-Based pruning methods for web search

ACM Transactions on Information Systems (TOIS)
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Implementing the LZ-index: Theory versus practice

Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
RLH: Bitmap compression technique based on run-length and Huffman encoding

Information Systems
An efficient compression code for text databases

ECIR'03 Proceedings of the 25th European conference on IR research
Compressing semistructured text databases

ECIR'03 Proceedings of the 25th European conference on IR research
Improving semistatic compression via pair-based coding

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Dynamic lightweight text compression

ACM Transactions on Information Systems (TOIS)
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
Scalable, statistical storage allocation for extensible inverted file construction

Journal of Systems and Software
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Compressing dynamic text collections via phrase-based coding

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Exploiting SIMD instructions in current processors to improve classical string algorithms

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning.In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.