TF-IDF uncovered: a study of theories and probabilities

  • Authors:
  • Thomas Roelleke;Jun Wang

  • Affiliations:
  • Queen Mary, University of London, London, United Kngdm;Queen Mary, University of London, London, United Kngdm

  • Venue:
  • Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.02

Visualization

Abstract

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.