Fast set intersection in memory

Authors:
Bolin Ding;Arnd Christian König
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;Microsoft Research, Redmond, WA
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 17
Cited 6

A skip list cookbook

A skip list cookbook
Fast set operations using treaps

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
A Fast Merging Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Adaptive intersection and t-threshold problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Experiments on Adaptive Set Intersections for Text Retrieval Systems

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
An experimental investigation of set intersection algorithms for text searching

Journal of Experimental Algorithmics (JEA)
Concentration of Measure for the Analysis of Randomized Algorithms

Concentration of Measure for the Analysis of Randomized Algorithms
On efficient posting list intersection with multicore processors

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving the performance of list intersection

Proceedings of the VLDB Endowment
Fast evaluation of union-intersection expressions

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Worst case optimal union-intersection expression evaluation

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
Optimizing index for taxonomy keyword search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Faster upper bounding of intersection sizes

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Privacy-preserving smart metering with multiple data Consumers

Computer Networks: The International Journal of Computer and Telecommunications Networking
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment
Efficient query processing for XML keyword queries based on the IDList index

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time [EQUATION], where r is the intersection size and w is the number of bits in a machine-word. In addition, we introduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.