A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Authors:
Martin Klein;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, VA, USA;Old Dominion University, Norfolk, VA, USA
Venue:
Proceedings of the 10th ACM workshop on Web information and data management
Year:
2008

Citing 16
Cited 4

Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Robust Hyperlinks Cost Just Five Words Each

Robust Hyperlinks Cost Just Five Words Each
Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of lexical signatures for improving information persistence on the World Wide Web

ACM Transactions on Information Systems (TOIS)
The WT10G dataset and the evolution of the web

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Methods for comparing rankings of search engine results

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Web-based inference detection

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Revisiting Lexical Signatures to (Re-)Discover Web Pages

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
WordRank-Based lexical signatures for finding lost or related web pages

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Automated extraction of hit numbers from search result pages

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Correlation of Term Count and Document Frequency for Google N-Grams

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
Entity disambiguation for knowledge base population

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of web pages. Our objective is to investigate how accurate these estimation methods are compared to the a baseline. We use the Google N-grams as our baseline and compare it against two IDF estimation techniques which are based on: 1) a "local universe" consisting of textual content and the according document frequencies from copies of URLs from the Internet Archive and 2) "screen scraping", a technique to query the Google web interface for document frequencies. We found a term overlap of 70 to 80% between the results of the two methods and the baseline. We further discovered a great agreement in rank correlation of TF-IDF ranked terms between our methods. Kendall τ is approximately 0.8 and the M-Score (penalizing discordances in higher ranks) is even higher, it peaks at well above 0.9. These preliminary results lead us to the conclusion that both methods are appropriate for creating accurate IDF values for web pages.