Fast candidate generation for real-time tweet search with bloom filter chains

  • Authors:
  • Nima Asadi;Jimmy Lin

  • Affiliations:
  • University of Maryland at College Park;University of Maryland at College Park

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The rise of social media and other forms of user-generated content have created the demand for real-time search: against a high-velocity stream of incoming documents, users desire a list of relevant results at the time the query is issued. In the context of real-time search on tweets, this work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer to produce the final output. We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically-increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWand is able to perform both conjunctive and disjunctive retrieval. Experiments show that our algorithm is many times faster than competitive baselines and that this increased performance does not require sacrificing end-to-end effectiveness. Our results empirically characterize the trade-off space defined by output quality, query evaluation speed, and memory footprint for this particular search architecture.