TF-IDF uncovered: a study of theories and probabilities

Authors:
Thomas Roelleke;Jun Wang
Affiliations:
Queen Mary, University of London, London, United Kngdm;Queen Mary, University of London, London, United Kngdm
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 11
Cited 12

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
On modeling information retrieval with probabilistic inference

ACM Transactions on Information Systems (TOIS)
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
An information-theoretic perspective of tf—idf measures

Information Processing and Management: an International Journal
Bayesian extension to the language model for ad hoc information retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A frequency-based and a poisson-based definition of the probability of being informative

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)

Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval)
Relevance information: a loss of entropy but a gain for IDF?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A parallel derivation of probabilistic information retrieval models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A study of Poisson query generation model for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

A personalised query suggestion agent based on query-concept bipartite graphs and Concept Relation Trees

International Journal of Advanced Intelligence Paradigms
Entropy-biased models for query representation on the click graph

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
MatchSim: a novel neighbor-based similarity measure with maximum neighborhood matching

Proceedings of the 18th ACM conference on Information and knowledge management
An E-collaborative learning environment based on dynamic workflow system

ITHET'10 Proceedings of the 9th international conference on Information technology based higher education and training
Efficient diversity-aware search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards a better understanding of the relationship between probabilistic models in IR

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support

Web Semantics: Science, Services and Agents on the World Wide Web
Vocabulary filtering for term weighting in archived question search

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
IR models: foundations and relationships

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Mining interests for user profiling in electronic conversations

Expert Systems with Applications: An International Journal
Predicting relevant documents for enterprise communication contexts

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Bridging memory-based collaborative filtering and text retrieval

Information Retrieval

Quantified Score

Hi-index	0.02

Visualization

Abstract

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.