Taily: shard selection using the tail of score distributions

Authors:
Robin Aly;Djoerd Hiemstra;Thomas Demeester
Affiliations:
University Twente, Enschede, Netherlands;University Twente, Enschede, Netherlands;Ghent University, Ghent, Belgium
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 18
Cited 0

Query optimization for parallel execution

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval

ACM Transactions on Information Systems (TOIS)
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A pipelined architecture for distributed text query evaluation

Information Retrieval
Efficiency trade-offs in two-tier web search systems

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
SUSHI: scoring scaled samples for server selection

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Classification-based resource selection

Proceedings of the 18th ACM conference on Information and knowledge management
Score distribution models: assumptions, intuition, and robustness to score manipulation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document allocation policies for selective searching of distributed indexes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
MapReduce for information retrieval evaluation: "let's quickly test this on 12 TB of data"

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Modeling score distributions in information retrieval

Information Retrieval
Federated Search

Foundations and Trends in Information Retrieval
Modeling document scores for distributed information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Shard ranking and cutoff estimation for topically partitioned collections

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.