Bringing your dead links back to life: a comprehensive approach and lessons learned

Authors:
Atsuyuki Morishima;Akiyoshi Nakamizo;Toshinari Iida;Shigeo Sugimoto;Hiroyuki Kitagawa
Affiliations:
University of Tsukuba, Tsukuba, Japan;Shibaura Institute of Technology, Tokyo, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan
Venue:
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Year:
2009

Citing 17
Cited 5

Maintaining distributed hypertext infostructures: welcome to MOMspider's Web

Selected papers of the first conference on World-Wide Web
Fixing the “broken-link” problem: the W3Objects approach

Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Referential integrity of links in open hypermedia systems

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Dynamic reference sifting: a case study in the homepage domain

Selected papers from the sixth international conference on World Wide Web
Missing the 404: link integrity on the World Wide Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Hypertext link integrity

ACM Computing Surveys (CSUR)
Squeal: a structured query language for the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Electronic document addressing: dealing with change

ACM Computing Surveys (CSUR)
A survey of Web metrics

ACM Computing Surveys (CSUR)
Query Pairs as Hypertext Links

Proceedings of the Seventh International Conference on Data Engineering
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Analysis of lexical signatures for improving information persistence on the World Wide Web

ACM Transactions on Information Systems (TOIS)
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web

DSNotify: handling broken links in the web of data

Proceedings of the 19th international conference on World wide web
A more specific events classification to improve crawling techniques

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Towards designing an efficient crawling window to analysis and annotate changes in linked data sources

Proceedings of the 1st International Workshop on Linked Web Data Management
DSNotify - A solution for event detection and link maintenance in dynamic datasets

Web Semantics: Science, Services and Agents on the World Wide Web
Updating broken web links: An automatic recommendation system

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an experimental study of the automatic correction of broken (dead) Web links focusing, in particular, on links broken by the relocation ofWeb pages. Our first contribution is that we developed an algorithm that incorporates a comprehensive set of heuristics, some of which are novel, in a single unified framework. The second contribution is that we conducted a relatively large-scale experiment, and analysis of our results revealed the characteristics of the problem of finding movedWeb pages. We demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. First, it is impossible to identify the final destination until the page is moved, so the index-server approach is not necessarily effective. Secondly, there is a large bias about where the new address is likely to be and crawler-based solutions can be effectively implemented, avoiding the need to search the entire Web. We analyzed the experimental results in detail to show how important each heuristic is in real Web settings, and conducted statistical analyses to show that our algorithm succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level.