Generalized inverse document frequency

  • Authors:
  • Donald Metzler

  • Affiliations:
  • Yahoo! Research, Santa Clara, CA, USA

  • Venue:
  • Proceedings of the 17th ACM conference on Information and knowledge management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.