Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Logistic Regression, AdaBoost and Bregman Distances
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Adaptive language modeling using the maximum entropy principle
HLT '93 Proceedings of the workshop on Human Language Technology
Using term informativeness for named entity detection
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Automated classification of congressional legislation
dg.o '06 Proceedings of the 2006 international conference on Digital government research
IDF revisited: a simple new derivation within the Robertson-Spärck Jones probabilistic model
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Interpreting TF-IDF term weights as making relevance decisions
ACM Transactions on Information Systems (TOIS)
Active learning for e-rulemaking: public comment categorization
dg.o '08 Proceedings of the 2008 international conference on Digital government research
The Evaluation of Sentence Similarity Measures
DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Generalized inverse document frequency
Proceedings of the 17th ACM conference on Information and knowledge management
Part of Speech Based Term Weighting for Information Retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Corpus-based and knowledge-based measures of text semantic similarity
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Semantic-based estimation of term informativeness
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
TITPI: web people search task using semi-supervised clustering approach
SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Personal name disambiguation in web search results based on a semi-supervised clustering approach
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Term-weighting for summarization of multi-party spoken dialogues
MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Probabilistic word vector and similarity based on dictionaries
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
NEWS '10 Proceedings of the 2010 Named Entities Workshop
The plagiarism detection by compression method
Proceedings of the 12th International Conference on Computer Systems and Technologies
A behavioural mode research on user-focus summarization
Mathematical and Computer Modelling: An International Journal
Personalized Query Expansion for Web Search Using Social Keywords
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
Inverse Document Frequency (IDF) is a popular measure of a word's importance. The IDF invariably appears in a host of heuristic measures used in information retrieval. However, so far the IDF has itself been a heuristic. In this paper, we show IDF to be optimal in a principled sense. We show that IDF is the optimal weight of a word with respect to minimization of a Kullback-Leibler distance suitably generalized to nonnegative functions which need not be probability distributions. This optimization problem is closely related to maximum entropy problem. We show that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where we treat each document as the query that retrieves itself. That is, IDF is optimal for document self-retrieval.