Efficient set intersection for inverted indexing

Authors:
J. Shane Culpepper;Alistair Moffat
Affiliations:
RMIT University and The University of Melbourne, Australia;The University of Melbourne, Australia
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2010

Citing 20
Cited 11

Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Compact pat trees

Compact pat trees
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Membership in Constant Time and Almost-Minimum Space

SIAM Journal on Computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Adaptive intersection and t-threshold problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Compression and Coding Algorithms

Compression and Coding Algorithms
Low Redundancy in Static Dictionaries with Constant Query Time

SIAM Journal on Computing
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Compact representations of ordered sets

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Squeezing succinct data structures into entropy bounds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Compressed dictionaries: space measures, data sets, and experiments

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms

Composite hashing with multiple information sources

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Efficient processing of containment queries on nested sets

Proceedings of the 16th International Conference on Extending Database Technology
Semantic hashing using tags and topic modeling

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Dynamic memory allocation policies for postings in real-time Twitter search

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Efficient Video Stream Monitoring for Near-Duplicate Detection and Localization in a Large-Scale Repository

ACM Transactions on Information Systems (TOIS)
On the compression of search trees

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.