Approximating the top-m passages in a parallel question answering system

  • Authors:
  • Charles L. A. Clarke;Egidio L. Terra

  • Affiliations:
  • University of Waterloo, Waterloo, Canada;University of Waterloo, Waterloo, Canada

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We examine the problem of retrieving the top-m ranked items from a large collection, randomly distributed across an n-node system. In order to retrieve the top m overall, we must retrieve the top m from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-m items may be missed, it is possible to reduce computation time by retrieving only the top k from each node. In this paper, we demonstrate that this simple observation can be exploited in a realistic application to produce a substantial efficiency improvement without compromising the quality of the retrieved results. To support our claim, we present a statistical model that predicts the impact of the optimization. The paper is structured around a specific application~---~passage retrieval for question answering~---~but the primary results are more broadly applicable.