Effective change detection using sampling

Authors:
Junghoo Cho;Alexandros Ntoulas
Affiliations:
UCLA Computer Science Department, Los Angeles, CA;UCLA Computer Science Department, Los Angeles, CA
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 16
Cited 27

An algorithm for concurrency control and recovery in replicated distributed databases

ACM Transactions on Database Systems (TODS)
Data caching issues in an information retrieval system

ACM Transactions on Database Systems (TODS)
Providing high availability using lazy replication

ACM Transactions on Computer Systems (TOCS)
Bounded ignorance: a technique for increasing concurrency in a replicated system

ACM Transactions on Database Systems (TODS)
Supporting multiple view maintenance policies

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Applying the golden rule of sampling for query estimation

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mercator: A scalable, extensible Web crawler

World Wide Web
A Query Sampling Method of Estimating Local Cost Parameters in a Multidatabase System

Proceedings of the Tenth International Conference on Data Engineering
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science

Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Performance and cost tradeoffs in Web search

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
The infocious web search engine: improving web searching through linguistic analysis

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Data quality inference

Proceedings of the 2nd international workshop on Information quality in information systems
Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management
Adaptive pull-based policies for wide area data delivery

ACM Transactions on Database Systems (TODS)
Temporal multi-page summarization

Web Intelligence and Agent Systems
Designing efficient sampling techniques to detect webpage updates

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
Modeling and managing changes in text databases

ACM Transactions on Database Systems (TODS)
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
Web Crawling

Foundations and Trends in Information Retrieval
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
Best-effort refresh strategies for content-based RSS feed aggregation

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
State transfer graph: an efficient tool for webview maintenance

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Temporal ranking of search engine results

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
A hybrid approach for refreshing web page repositories

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Adaptive change estimation in the context of online market monitoring

EUROCAST'11 Proceedings of the 13th international conference on Computer Aided Systems Theory - Volume Part I
Predicting content change on the web

Proceedings of the sixth ACM international conference on Web search and data mining
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
A Hybrid Approach for Web Change Detection

International Journal of Information Technology and Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

For a large-scale data-intensive environment, such as the World-Wide Web or data warehousing, we often make local copies of remote data sources. Due to limited network and computational resources, however, it is often difficult to monitor the sources constantly to check for changes and to download changed data items to the copies. In this scenario, our goal is to detect as many changes as we can using the fixed download resources that we have. In this paper we propose three sampling-based download policies that can identify more changed data items effectively. In our sampling-based approach, we first sample a small number of data items from each data source and download more data items from the sources with more changed samples. We analyze the effectiveness of the sampling-based policies and compare our proposed policies to existing ones, including the state-of-the-art frequency-based policy in [8, 11]. Our experiments on synthetic and real-world data will show the relative merits of various policies and the great potential of our sampling-based policy. In certain cases, our sampling-based policy could download twice as many changed items as the best existing policy.