Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
Modern Information Retrieval
The Importance of Length Normalization for XML Retrieval
Information Retrieval
Extension of Zipf's law to words and phrases
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Domain-specific sentiment analysis using contextual feature generation
Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion
Extending Zipf's law to n-grams for large corpora
Artificial Intelligence Review
Exploring the stability of IDF term weighting
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing
Computer Networks: The International Journal of Computer and Telecommunications Networking
Text clustering for peer-to-peer networks with probabilistic guarantees
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exploring hypergraph-based semi-supervised ranking for query-oriented summarization
Information Sciences: an International Journal
Hi-index | 0.00 |
The trend in information retrieval systems is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are 5.5 and 10.4 higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of 1.6 for documents and 1.7 for sentences and terms. We conclude with an analysis of IDF stability with respect to random, journal, and section partitions of the 100,830 full-text scientific articles in our experimental corpus.