Analysis of lexical signatures for finding lost or related documents

Authors:
Seung-Taek Park;David M. Pennock;C. Lee Giles;Robert Krovetz
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;NEC Research Institute;NEC Research Institute
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 9
Cited 9

Maintaining distributed hypertext infostructures: welcome to MOMspider's Web

Selected papers of the first conference on World-Wide Web
WebLinker, a tool for managing WWW cross-references

Computer Networks and ISDN Systems
Fixing the “broken-link” problem: the W3Objects approach

Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Summary of WWW characterizations

WWW7 Proceedings of the seventh international conference on World Wide Web 7
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Persistence of Web References in Scientific Research

Computer
Towards an Archival Intermemory

ADL '98 Proceedings of the Advances in Digital Libraries Conference

Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Analysis of lexical signatures for improving information persistence on the World Wide Web

ACM Transactions on Information Systems (TOIS)
Index-Based Persistent Document Identifiers

Information Retrieval
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
A study about browsers in the Web and the Desktop

EATIS '07 Proceedings of the 2007 Euro American conference on Telematics and information systems
Retrieving similar documents from the web

Journal of Web Engineering
WordRank-Based lexical signatures for finding lost or related web pages

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Identifying "soft 404" error pages: analyzing the lexical signatures of documents in distributed collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.