SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Robust Hyperlinks Cost Just Five Words Each
Robust Hyperlinks Cost Just Five Words Each
Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of lexical signatures for improving information persistence on the World Wide Web
ACM Transactions on Information Systems (TOIS)
The WT10G dataset and the evolution of the web
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Methods for comparing rankings of search engine results
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Revisiting Lexical Signatures to (Re-)Discover Web Pages
ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
WordRank-Based lexical signatures for finding lost or related web pages
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Automated extraction of hit numbers from search result pages
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Correlation of Term Count and Document Frequency for Google N-Grams
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure
Proceedings of the 10th annual joint conference on Digital libraries
Entity disambiguation for knowledge base population
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Hi-index | 0.01 |
For bounded datasets such as the TREC Web Track the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, since IDF cannot be directly calculated for the entire web, it must be estimated. We see a need to estimate accurate IDF values to generate TF-IDF based lexical signatures (LSs) of web pages. Future applications for generating such LSs require a real time IDF computation. Therefore we conducted a comparison study of different methods to estimate IDF values of web pages. Our objective is to investigate how accurate these estimation methods are compared to the a baseline. We use the Google N-grams as our baseline and compare it against two IDF estimation techniques which are based on: 1) a "local universe" consisting of textual content and the according document frequencies from copies of URLs from the Internet Archive and 2) "screen scraping", a technique to query the Google web interface for document frequencies. We found a term overlap of 70 to 80% between the results of the two methods and the baseline. We further discovered a great agreement in rank correlation of TF-IDF ranked terms between our methods. Kendall τ is approximately 0.8 and the M-Score (penalizing discordances in higher ranks) is even higher, it peaks at well above 0.9. These preliminary results lead us to the conclusion that both methods are appropriate for creating accurate IDF values for web pages.