Maintaining distributed hypertext infostructures: welcome to MOMspider's Web
Selected papers of the first conference on World-Wide Web
WebLinker, a tool for managing WWW cross-references
Computer Networks and ISDN Systems
Author-oriented link management
Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Fixing the “broken-link” problem: the W3Objects approach
Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
CiteSeer: an automatic citation indexing system
Proceedings of the third ACM conference on Digital libraries
BYTE
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Summary of WWW characterizations
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
IR evaluation methods for retrieving highly relevant documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Evaluation by highly relevant documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of lexical signatures for finding lost or related documents
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Context and Page Analysis for Improved Web Search
IEEE Internet Computing
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Towards an Archival Intermemory
ADL '98 Proceedings of the Advances in Digital Libraries Conference
Robust Hyperlinks Cost Just Five Words Each
Robust Hyperlinks Cost Just Five Words Each
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
Revisiting Lexical Signatures to (Re-)Discover Web Pages
ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A comparison of techniques for estimating IDF values to generate lexical signatures for the web
Proceedings of the 10th ACM workshop on Web information and data management
Why are moved web pages difficult to find?: the WISH approach
Proceedings of the 18th international conference on World wide web
Correlation of Term Count and Document Frequency for Google N-Grams
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Inter-search engine lexical signature performance
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Bringing your dead links back to life: a comprehensive approach and lessons learned
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure
Proceedings of the 10th annual joint conference on Digital libraries
Rediscovering missing web pages using link neighborhood lexical signatures
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
DSNotify - A solution for event detection and link maintenance in dynamic datasets
Web Semantics: Science, Services and Agents on the World Wide Web
WordRank-Based lexical signatures for finding lost or related web pages
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A system architecture as a support to a flexible annotation service
DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures
Identification of top relevant temporal expressions in documents
Proceedings of the 2nd Temporal Web Analytics Workshop
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Hi-index | 0.01 |
A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.