Document frequency and term specificity

Authors:
Hideo Joho;Mark Sanderson
Affiliations:
University of Glasgow, Glasgow;University of Sheffield, Sheffield
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 8
Cited 3

Use of syntactic context to produce term association lists for text retrieval

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Concept based query expansion

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
WordNet: a lexical database for English

Communications of the ACM
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The paraphrase search assistant: terminological feedback for iterative information seeking

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
The impact on retrieval effectiveness of skewed frequency distributions

ACM Transactions on Information Systems (TOIS)
Automatic Information Organization and Retrieval.

Automatic Information Organization and Retrieval.

Improving automated requirements trace retrieval: a study of term-based enhancement methods

Empirical Software Engineering
Embellishing text search queries to protect user privacy

Proceedings of the VLDB Endowment
Detecting weak signals for long-term business opportunities using text mining of Web news

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document frequency is used in various applications in Information Retrieval and other related fields. An assumption frequently made is that the document frequency represents a level of the term's specificity. However, empirical results to support this assumption are limited. Therefore, a large-scale experiment was carried out, using multiple corpora, to gain further insight into the relationship between the document frequency and term specificity. The results show that the assumption holds only at the very specific levels that cover the majority of vocabulary. The results also show that a larger corpus is more accurate at estimating the specificity. However, the co-occurrence information is shown to be effective for improving the accuracy when only a small corpus is available.