A comparison of document, sentence, and term event spaces

Authors:
Catherine Blake
Affiliations:
University of North Carolina at Chapel Hill, North Carolina, NC
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 5
Cited 7

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
Modern Information Retrieval

Modern Information Retrieval
The Importance of Length Normalization for XML Retrieval

Information Retrieval
Extension of Zipf's law to words and phrases

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

A Term Association Inference Model for Single Documents: A Stepping Stone for Investigation through Information Extraction

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Domain-specific sentiment analysis using contextual feature generation

Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion
Extending Zipf's law to n-grams for large corpora

Artificial Intelligence Review
Exploring the stability of IDF term weighting

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing

Computer Networks: The International Journal of Computer and Telecommunications Networking
Text clustering for peer-to-peer networks with probabilistic guarantees

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exploring hypergraph-based semi-supervised ranking for query-oriented summarization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The trend in information retrieval systems is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are 5.5 and 10.4 higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of 1.6 for documents and 1.7 for sentences and terms. We conclude with an analysis of IDF stability with respect to random, journal, and section partitions of the 100,830 full-text scientific articles in our experimental corpus.