Just-in-time recovery of missing web pages

Authors:
Terry L. Harrison;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, VA;Old Dominion University, Norfolk, VA
Venue:
Proceedings of the seventeenth conference on Hypertext and hypermedia
Year:
2006

Citing 17
Cited 14

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Managing change on the web

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Web page change and persistence---a four-year longitudinal study

Journal of the American Society for Information Science and Technology
The decay and failures of web references

Communications of the ACM
Persistence of Web References in Scientific Research

Computer
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
The OAI-PMH static repository and static repository gateway

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Robust Hyperlinks Cost Just Five Words Each

Robust Hyperlinks Cost Just Five Words Each
Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Managing distributed collections: evaluating web page changes, movement, and replacement

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Analysis of lexical signatures for improving information persistence on the World Wide Web

ACM Transactions on Information Systems (TOIS)
The LOCKSS peer-to-peer digital preservation system

ACM Transactions on Computer Systems (TOCS)
Shuffling a stacked deck: the case for partially randomized ranking of search engine results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia

Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Usage analysis of a public website reconstruction tool

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Revisiting Lexical Signatures to (Re-)Discover Web Pages

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
Finding what is missing from a digital library: A case study in the Computer Science field

Information Processing and Management: an International Journal
Correlation of Term Count and Document Frequency for Google N-Grams

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Retrieving broken web links using an approach based on contextual information

Proceedings of the 20th ACM conference on Hypertext and hypermedia
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
DSNotify - A solution for event detection and link maintenance in dynamic datasets

Web Semantics: Science, Services and Agents on the World Wide Web
Analyzing information retrieval methods to recover broken web links

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Updating broken web links: An automatic recommendation system

Information Processing and Management: an International Journal
Identifying "soft 404" error pages: analyzing the lexical signatures of documents in distributed collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers by mutual harvesting using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.