Effective page refresh policies for Web crawlers

Authors:
Junghoo Cho;Hector Garcia-Molina
Affiliations:
University of California, Los Angeles, California;Stanford University, Stanford, California
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2003

Citing 31
Cited 48

An algorithm for concurrency control and recovery in replicated distributed databases

ACM Transactions on Database Systems (TODS)
Data caching issues in an information retrieval system

ACM Transactions on Database Systems (TODS)
Bounded ignorance in replicated systems

PODS '91 Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Replica control in distributed systems: as asynchronous approach

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Providing high availability using lazy replication

ACM Transactions on Computer Systems (TOCS)
Bounded ignorance: a technique for increasing concurrency in a replicated system

ACM Transactions on Database Systems (TODS)
Supporting multiple view maintenance policies

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Towards a better understanding of Web resources and server responses for improved caching

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
On the scale and performance of cooperative Web proxy caching

Proceedings of the seventeenth ACM symposium on Operating systems principles
Accessibility of information on the Web

intelligence
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Keeping Up with the Changing Web

Computer
The Demarcation Protocol: A Technique for Maintaining Linear Arithmetic Constraints in Distributed Database Systems

EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Numerical Error Bounding for Replicated Network Services

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On the distribution of an assertion

PODC '82 Proceedings of the first ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
MODELING REPLICA DIVERGENCE IN A WEAK-CONSISTENCY PROTOCOL FOR GLOBAL-SCALE DISTRIBUTED DATA BASES

MODELING REPLICA DIVERGENCE IN A WEAK-CONSISTENCY PROTOCOL FOR GLOBAL-SCALE DISTRIBUTED DATA BASES
Crawling the web: discovery and maintenance of large-scale web data

Crawling the web: discovery and maintenance of large-scale web data
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Exploring the tradeoff between performance and data freshness in database-driven Web servers

The VLDB Journal — The International Journal on Very Large Data Bases
Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management
Estimation of internet file-access/modification rates from indirect data

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Efficient, automatic web resource harvesting

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Eigen-trend: trend analysis in the blogosphere based on singular value decompositions

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Designing efficient sampling techniques to detect webpage updates

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
Updating collection representations for federated search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Locality-Based pruning methods for web search

ACM Transactions on Information Systems (TOIS)
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Estimating the Change of Web Pages

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Maintaining dynamic channel profiles on the web

Proceedings of the VLDB Endowment
Parallel crawler architecture and web page change detection

WSEAS Transactions on Computers
A three-year study on the freshness of web search engine databases

Journal of Information Science
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Data quality in web archiving

Proceedings of the 3rd workshop on Information credibility on the web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A Web data extraction approach to harvesting data from online sources

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Greedy algorithms for sequential sensing decisions

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Web Crawling

Foundations and Trends in Information Retrieval
Optimising context data dissemination and storage in distributed pervasive computing systems

Pervasive and Mobile Computing
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
News page discovery policy for instant crawlers

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Discover hierarchical subgraphs with network-topology based ranking score

Proceedings of the Third C* Conference on Computer Science and Software Engineering
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Tuning QoD in stream processing engines

ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Towards a quality-oriented real-time web crawler

WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
Time-weighted web authoritative ranking

Information Retrieval
Federated Search

Foundations and Trends in Information Retrieval
Best-effort refresh strategies for content-based RSS feed aggregation

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Design and implement a web news retrieval system

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III
Online change estimation models for dynamic web resources: a case-study of RSS feed refresh strategies

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Predicting content change on the web

Proceedings of the sixth ACM international conference on Web search and data mining
Archival HTTP redirection retrieval policies

Proceedings of the 22nd international conference on World Wide Web companion
Archiving the relaxed consistency web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.