Index compression using 64-bit words

Authors:
Vo Ngoc Anh;Alistair Moffat
Affiliations:
Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia;Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia
Venue:
Software—Practice & Experience
Year:
2010

Citing 20
Cited 7

Adding compression to a full-text retrieval system

Software—Practice & Experience
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
Compressing Inverted Files

Information Retrieval
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Improved Word-Aligned Binary Compression for Text Indexing

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Sigma encoded inverted files

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms

SkipBlock: self-indexing for block-based inverted list

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Searching web data: An entity retrieval and high-performance indexing model

Web Semantics: Science, Services and Agents on the World Wide Web
Modern B-Tree Techniques

Foundations and Trends in Databases
Reordering rows for better compression: Beyond the lexicographic order

ACM Transactions on Database Systems (TODS)
Lossless asymmetric single instruction multiple data codec

Software—Practice & Experience
An index for efficient semantic full-text search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean, ranked, and phrase queries; which means that it can be used in any situation. Copyright © 2010 John Wiley & Sons, Ltd.