Simple BM25 extension to multiple weighted fields

  • Authors:
  • Stephen Robertson;Hugo Zaragoza;Michael Taylor

  • Affiliations:
  • Microsoft Research, Cambridge, U.K.;Microsoft Research, Cambridge, U.K.;Microsoft Research, Cambridge, U.K.

  • Venue:
  • Proceedings of the thirteenth ACM international conference on Information and knowledge management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.