An experimental investigation of set intersection algorithms for text searching

Authors:
Jérémy Barbay;Alejandro López-Ortiz;Tyler Lu;Alejandro Salinger
Affiliations:
Universidad de Chile, Santiago, Chile;University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2010

Citing 6
Cited 15

Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Adaptive intersection and t-threshold problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Interpolation search for non-independent data

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Alternation and redundancy analysis of the intersection problem

ACM Transactions on Algorithms (TALG)
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Range queries over untangled chains

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Fast set intersection in memory

Proceedings of the VLDB Endowment
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Workload-aware indexing for keyword search in social networks

Proceedings of the 20th ACM international conference on Information and knowledge management
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Fast intersection algorithms for sorted sequences

Algorithms and Applications
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Faster and smaller inverted indices with treaps

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Compact binary relation representations with rich functionality

Information and Computation
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this article, we propose several improved algorithms for computing the intersection of sorted arrays, and in particular for searching sorted arrays in the intersection context. We perform an experimental comparison with the algorithms from the previous studies from Demaine, López-Ortiz, and Munro [ALENEX 2001] and from Baeza-Yates and Salinger [SPIRE 2005]; in addition, we implement and test the intersection algorithm from Barbay and Kenyon [SODA 2002] and its randomized variant [SAGA 2003]. We consider both the random data set from Baeza-Yates and Salinger, the Google queries used by Demaine et al., a corpus provided by Google, and a larger corpus from the TREC Terabyte 2006 efficiency query stream, along with its own query log. We measure the performance both in terms of the number of comparisons and searches performed, and in terms of the CPU time on two different architectures. Our results confirm or improve the results from both previous studies in their respective context (comparison model on real data, and CPU measures on random data) and extend them to new contexts. In particular, we show that value-based search algorithms perform well in posting lists in terms of the number of comparisons performed.