Experiments on Adaptive Set Intersections for Text Retrieval Systems

Authors:
Erik D. Demaine;Alejandro López-Ortiz;J. Ian Munro
Affiliations:
-;-;-
Venue:
ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Year:
2001

Citing 5
Cited 18

Efficient text searching

Efficient text searching
A survey of adaptive sorting algorithms

ACM Computing Surveys (CSUR)
“Real world” searching panel at SIGIR 97

ACM SIGIR Forum
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Adaptive set intersections, unions, and differences

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms

Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Fast set intersection in memory

Proceedings of the VLDB Endowment
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Fast lists intersection with Bloom filter using graphics processing units

Proceedings of the 2011 ACM Symposium on Applied Computing
Efficient parallel lists intersection and index compression algorithms using graphics processing units

Proceedings of the VLDB Endowment
Posting list intersection on multicore architectures

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Adaptive searching in succinctly encoded binary relations and tree-structured documents

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Fast intersection algorithms for sorted sequences

Algorithms and Applications
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Fast candidate generation for two-phase document ranking: postings list intersection with bloom filters

Proceedings of the 21st ACM international conference on Information and knowledge management
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Latency-aware strategy for static list caching in flash-based web search engines

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Exploiting query term correlation for list caching in web search engines

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval
Efficient query processing for XML keyword queries based on the IDList index

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In [3] we introduced an adaptive algorithm for computing the intersection of k sorted sets within a factor of at most 8k comparisons of the information-theoretic lower bound under a model that deals with an encoding of the shortest proof of the answer. This adaptive algorithm performs better for "burstier" inputs than a straightforward worst-case optimal method. Indeed, we have shown that, subject to a reasonable measure of instance difficulty, the algorithm adapts optimally up to a constant factor. This paper explores how this algorithm behaves under actual data distributions, compared with standard algorithms. We present experiments for searching 114 megabytes of text from the World Wide Web using 5,000 actual user queries from a commercial search engine. From the experiments, it is observed that the theoretically optimal adaptive algorithm is not always the optimal in practice, given the distribution of WWW text data. We then proceed to study several improvement techniques for the standard algorithms. These techniques combine improvements suggested by the observed distribution of the data as well as the theoretical results from [3]. We perform controlled experiments on these techniques to determine which ones result in improved performance, resulting in an algorithm that outperforms existing algorithms in most cases.