Analysis of lexical signatures for improving information persistence on the World Wide Web

Authors:
Seung-Taek Park;David M. Pennock;C. Lee Giles;Robert Krovetz
Affiliations:
Yahoo! Research Labs, Pasadena, CA;Yahoo! Research Labs, Pasadena, CA;The Pennsylvania State University, University Park, PA;Ask Jeeves, Piscataway, NJ
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2004

Citing 21
Cited 15

Maintaining distributed hypertext infostructures: welcome to MOMspider's Web

Selected papers of the first conference on World-Wide Web
WebLinker, a tool for managing WWW cross-references

Computer Networks and ISDN Systems
Author-oriented link management

Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Fixing the “broken-link” problem: the W3Objects approach

Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Weaving a better Web

BYTE
A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Summary of WWW characterizations

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of lexical signatures for finding lost or related documents

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Context and Page Analysis for Improved Web Search

IEEE Internet Computing
Digital Libraries and Autonomous Citation Indexing

Computer
Persistence of Web References in Scientific Research

Computer
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Towards an Archival Intermemory

ADL '98 Proceedings of the Advances in Digital Libraries Conference
Robust Hyperlinks Cost Just Five Words Each

Robust Hyperlinks Cost Just Five Words Each
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Revisiting Lexical Signatures to (Re-)Discover Web Pages

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
Why are moved web pages difficult to find?: the WISH approach

Proceedings of the 18th international conference on World wide web
Correlation of Term Count and Document Frequency for Google N-Grams

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Inter-search engine lexical signature performance

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Bringing your dead links back to life: a comprehensive approach and lessons learned

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
Rediscovering missing web pages using link neighborhood lexical signatures

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
DSNotify - A solution for event detection and link maintenance in dynamic datasets

Web Semantics: Science, Services and Agents on the World Wide Web
WordRank-Based lexical signatures for finding lost or related web pages

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A system architecture as a support to a flexible annotation service

DELOS'04 Proceedings of the 6th Thematic conference on Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures
Identification of top relevant temporal expressions in documents

Proceedings of the 2nd Temporal Web Analytics Workshop
Identifying "soft 404" error pages: analyzing the lexical signatures of documents in distributed collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.01

Visualization

Abstract

A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.