Why inverse document frequency?

Authors:
Kishore Papineni
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Year:
2001

Citing 4
Cited 19

Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Logistic Regression, AdaBoost and Bregman Distances

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Adaptive language modeling using the maximum entropy principle

HLT '93 Proceedings of the workshop on Human Language Technology

Using term informativeness for named entity detection

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Automated classification of congressional legislation

dg.o '06 Proceedings of the 2006 international conference on Digital government research
IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Interpreting TF-IDF term weights as making relevance decisions

ACM Transactions on Information Systems (TOIS)
Active learning for e-rulemaking: public comment categorization

dg.o '08 Proceedings of the 2008 international conference on Digital government research
The Evaluation of Sentence Similarity Measures

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Generalized inverse document frequency

Proceedings of the 17th ACM conference on Information and knowledge management
Part of Speech Based Term Weighting for Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Semantic-based estimation of term informativeness

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
TITPI: web people search task using semi-supervised clustering approach

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Personal name disambiguation in web search results based on a semi-supervised clustering approach

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Term-weighting for summarization of multi-party spoken dialogues

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Probabilistic word vector and similarity based on dictionaries

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Think globally, apply locally: using distributional characteristics for Hindi named entity identification

NEWS '10 Proceedings of the 2010 Named Entities Workshop
The plagiarism detection by compression method

Proceedings of the 12th International Conference on Computer Systems and Technologies
A behavioural mode research on user-focus summarization

Mathematical and Computer Modelling: An International Journal
Personalized Query Expansion for Web Search Using Social Keywords

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inverse Document Frequency (IDF) is a popular measure of a word's importance. The IDF invariably appears in a host of heuristic measures used in information retrieval. However, so far the IDF has itself been a heuristic. In this paper, we show IDF to be optimal in a principled sense. We show that IDF is the optimal weight of a word with respect to minimization of a Kullback-Leibler distance suitably generalized to nonnegative functions which need not be probability distributions. This optimization problem is closely related to maximum entropy problem. We show that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where we treat each document as the query that retrieves itself. That is, IDF is optimal for document self-retrieval.