Some inconsistencies and misnomers in probabilistic information retrieval
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On relevance weights with little relevance information
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A theory of term weighting based on exploratory data analysis
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of indexing and searching
SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
An information-theoretic perspective of tf—idf measures
Information Processing and Management: an International Journal
A frequency-based and a poisson-based definition of the probability of being informative
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Why inverse document frequency?
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Relevance information: a loss of entropy but a gain for IDF?
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A generative theory of relevance
A generative theory of relevance
On setting the hyper-parameters of term frequency normalization for information retrieval
ACM Transactions on Information Systems (TOIS)
IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling for large collections
Information Retrieval
A new probabilistic retrieval model based on the dirichlet compound multinomial distribution
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A machine learning approach for improved BM25 retrieval
Proceedings of the 18th ACM conference on Information and knowledge management
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Efficient food retrieval techniques considering relative frequencies of food related words
ICHIT'11 Proceedings of the 5th international conference on Convergence and hybrid information technology
Network vulnerability analysis using text mining
ACIIDS'12 Proceedings of the 4th Asian conference on Intelligent Information and Database Systems - Volume Part II
Automatic term mismatch diagnosis for selective query expansion
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. There have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make a number of strong assumptions that are often glossed over. In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a new, more generalized form of IDF that we call generalized inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms classical versions of IDF on a number of ad hoc retrieval tasks.