Correlation of Term Count and Document Frequency for Google N-Grams

Authors:
Martin Klein;Michael L. Nelson
Affiliations:
Department of Computer Science, Old Dominion University, Norfolk, VA 23529;Department of Computer Science, Old Dominion University, Norfolk, VA 23529
Venue:
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Year:
2009

Citing 13
Cited 0

Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Do TREC web collections look like the web?

ACM SIGIR Forum
Robust Hyperlinks Cost Just Five Words Each

Robust Hyperlinks Cost Just Five Words Each
Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Analysis of lexical signatures for improving information persistence on the World Wide Web

ACM Transactions on Information Systems (TOIS)
The WT10G dataset and the evolution of the web

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Web-based inference detection

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Revisiting Lexical Signatures to (Re-)Discover Web Pages

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

For bounded datasets such as the TREC Web Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for term count (TC) meaning the number of times a certain term occurs in the entire corpus. Intuitively this value is different from document frequency (DF) , the number of documents (e.g., web pages) a certain term occurs in. We investigate the relationship between TC and DF values of terms occurring in the Web as Corpus (WaC) and also the similarity between TC values obtained from the WaC and the Google N-gram dataset. A strong correlation between the two would gives us confidence in using the Google N-grams to estimate accurate IDF values which for example is the foundation to generate well performing lexical signatures based on the TF-IDF scheme. Our results show a very strong correlation between TC and DF within the WaC with Spearman's ρ *** 0.8 (p ≤ 2.2×10*** 16) and a high similarity between TC values from the WaC and the Google N-grams.