Compression of inverted indexes For fast query evaluation

Authors:
Falk Scholer;Hugh E. Williams;John Yiannis;Justin Zobel
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 11
Cited 76

Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Memory efficient ranking

Information Processing and Management: an International Journal - Special issue: data compression
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer

Indexing for fast categorisation

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
Index construction for linear categorisation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Operational requirements for scalable search systems

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Proceedings of the 2004 ACM symposium on Applied computing
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression using fixed binary codewords

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Fossilized index: the linchpin of trustworthy non-alterable electronic records

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Fast on-line index construction by geometric partitioning

Proceedings of the 14th ACM international conference on Information and knowledge management
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Improved Word-Aligned Binary Compression for Text Indexing

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient online index maintenance for contiguous inverted lists

Information Processing and Management: an International Journal
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Hybrid index maintenance for growing text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A combination of trie-trees and inverted files for the indexing of set-valued attributes

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Efficient query expansion with auxiliary data structures

Information Systems
Compression techniques for fast external sorting

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
A pipelined architecture for distributed text query evaluation

Information Retrieval
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Sigma encoded inverted files

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hybrid index maintenance for contiguous inverted lists

Information Retrieval
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Efficient online index construction for text databases

ACM Transactions on Database Systems (TODS)
Proximity Scoring Using Sentence-Based Inverted Index for Practical Full-Text Search

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for Commodity Microprocessors

Journal of Signal Processing Systems
Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal — The International Journal on Very Large Data Bases
RDF-3X: a RISC-style engine for RDF

Proceedings of the VLDB Endowment
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
RLH: Bitmap compression technique based on run-length and Huffman encoding

Information Systems
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Semplore: A scalable IR approach to search the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
Inverted indexes vs. bitmap indexes in decision support systems

Proceedings of the 18th ACM conference on Information and knowledge management
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Index compression using 64-bit words

Software—Practice & Experience
The RDF-3X engine for scalable management of RDF data

The VLDB Journal — The International Journal on Very Large Data Bases
External sorting with on-the-fly compression

BNCOD'03 Proceedings of the 20th British national conference on Databases
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Semplore: an IR approach to scalable hybrid query of semantic web data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Scalable online index construction with multi-core CPUs

ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Search in social networks with access control

Proceedings of the 2nd International Workshop on Keyword Search on Structured Data
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Reordering columns for smaller indexes

Information Sciences: an International Journal
A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Interpolative coding of integer sequences supporting log-time random access

Information Processing and Management: an International Journal
Factorization-based lossless compression of inverted indices

Proceedings of the 20th ACM international conference on Information and knowledge management
Workload-aware indexing for keyword search in social networks

Proceedings of the 20th ACM international conference on Information and knowledge management
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment
Efficient query evaluation through access-reordering

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Enhanced byte codes with restricted prefix properties

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Reordering rows for better compression: Beyond the lexicographic order

ACM Transactions on Database Systems (TODS)
Research on new algorithm of topic-oriented crawler and duplicated web pages detection

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Efficient indexing algorithms for approximate pattern matching in text

Proceedings of the Seventeenth Australasian Document Computing Symposium
Reordering an index to speed query processing without loss of effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Comparing Different Sparse Matrix Storage Structures as Index Structure for Arabic Text Collection

International Journal of Information Retrieval Research
Efficient fuzzy search in large text collections

ACM Transactions on Information Systems (TOIS)
Capturing programming content in online discussions

Proceedings of the seventh international conference on Knowledge capture
The impact of solid state drive on search engine cache management

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Faster and smaller inverted indices with treaps

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

International Journal of Information Retrieval Research
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.