An algorithm for concurrency control and recovery in replicated distributed databases
ACM Transactions on Database Systems (TODS)
Data caching issues in an information retrieval system
ACM Transactions on Database Systems (TODS)
Bounded ignorance in replicated systems
PODS '91 Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Replica control in distributed systems: as asynchronous approach
SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Providing high availability using lazy replication
ACM Transactions on Computer Systems (TOCS)
Bounded ignorance: a technique for increasing concurrency in a replicated system
ACM Transactions on Database Systems (TODS)
Supporting multiple view maintenance policies
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Life, death, and lawfulness on the electronic frontier
Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Towards a better understanding of Web resources and server responses for improved caching
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
On the scale and performance of cooperative Web proxy caching
Proceedings of the seventeenth ACM symposium on Operating systems principles
Accessibility of information on the Web
intelligence
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler
World Wide Web
Keeping Up with the Changing Web
Computer
EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Numerical Error Bounding for Replicated Network Services
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On the distribution of an assertion
PODC '82 Proceedings of the first ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
MODELING REPLICA DIVERGENCE IN A WEAK-CONSISTENCY PROTOCOL FOR GLOBAL-SCALE DISTRIBUTED DATA BASES
MODELING REPLICA DIVERGENCE IN A WEAK-CONSISTENCY PROTOCOL FOR GLOBAL-SCALE DISTRIBUTED DATA BASES
Crawling the web: discovery and maintenance of large-scale web data
Crawling the web: discovery and maintenance of large-scale web data
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Exploring the tradeoff between performance and data freshness in database-driven Web servers
The VLDB Journal — The International Journal on Very Large Data Bases
Looking at both the present and the past to efficiently update replicas of web content
Proceedings of the 7th annual ACM international workshop on Web information and data management
Estimation of internet file-access/modification rates from indirect data
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Efficient, automatic web resource harvesting
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Eigen-trend: trend analysis in the blogosphere based on singular value decompositions
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Designing efficient sampling techniques to detect webpage updates
Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts
IEEE Transactions on Knowledge and Data Engineering
Updating collection representations for federated search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Designing clustering-based web crawling policies for search engine crawlers
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Locality-Based pruning methods for web search
ACM Transactions on Information Systems (TOIS)
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Estimating the Change of Web Pages
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Maintaining dynamic channel profiles on the web
Proceedings of the VLDB Endowment
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
A three-year study on the freshness of web search engine databases
Journal of Information Science
Topical web crawling using weighted anchor text and web page change detection techniques
WSEAS Transactions on Information Science and Applications
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Proceedings of the 3rd workshop on Information credibility on the web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing complex extraction programs over evolving text data
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A Web data extraction approach to harvesting data from online sources
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Greedy algorithms for sequential sensing decisions
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Proceedings of the VLDB Endowment
Foundations and Trends in Information Retrieval
Optimising context data dissemination and storage in distributed pervasive computing systems
Pervasive and Mobile Computing
Efficiently detecting webpage updates using samples
ICWE'07 Proceedings of the 7th international conference on Web engineering
News page discovery policy for instant crawlers
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Discover hierarchical subgraphs with network-topology based ranking score
Proceedings of the Third C* Conference on Computer Science and Software Engineering
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
Tuning QoD in stream processing engines
ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Towards a quality-oriented real-time web crawler
WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
Time-weighted web authoritative ranking
Information Retrieval
Foundations and Trends in Information Retrieval
Best-effort refresh strategies for content-based RSS feed aggregation
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the quality of web archives through the importance of changes
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives
TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds
Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Design and implement a web news retrieval system
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Predicting content change on the web
Proceedings of the sixth ACM international conference on Web search and data mining
Archival HTTP redirection retrieval policies
Proceedings of the 22nd international conference on World Wide Web companion
Archiving the relaxed consistency web
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Timely crawling of high-quality ephemeral new content
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Development of an intelligent distributed news retrieval system
International Journal of Knowledge-based and Intelligent Engineering Systems
Hi-index | 0.00 |
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.