Approximate replication

Authors:
Jennifer Widom;Christopher Alden Remi Olston
Affiliations:
-;-
Venue:
Approximate replication
Year:
2003

Citing 0
Cited 10

Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Asking the right questions: model-driven optimization using probes

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Model-driven optimization using adaptive probes

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Value-based notification conditions in large-scale publish/subscribe systems?

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Fault-tolerance in the borealis distributed stream processing system

ACM Transactions on Database Systems (TODS)
Optimising context data dissemination and storage in distributed pervasive computing systems

Pervasive and Mobile Computing
ATMosphere: a system for atm microdeposit services in rural contexts

ICTD'09 Proceedings of the 3rd international conference on Information and communication technologies and development
How to probe for an extreme value

ACM Transactions on Algorithms (TALG)
Adaptive Uncertainty Resolution in Bayesian Combinatorial Optimization Problems

ACM Transactions on Algorithms (TALG)
Distributed network querying with bounded approximate caching

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In distributed environments that collect or monitor data, useful data may be spread across multiple distributed nodes, but users or applications may wish to access that data from a central location. A common way to facilitate centralized access to distributed data is to maintain replicas of data objects of interest at a central location. When data collections are large or volatile, keeping replicas consistent with remote master copies poses a significant challenge due to the large communication cost incurred. Consequently, in many real-world environments exact replica consistency is not maintained, and some form of inexact, or approximate, replication is typically used instead. Approximate replication is often performed by refreshing replicas periodically. Periodic refreshing allows communication cost to be controlled, but it does not always make good use of communication resources: In between refreshes some remote master copies may change significantly, leaving replicas excessively out of date and inaccurate, and meanwhile resources may be wasted refreshing replicas of other master copies that remain nearly unchanged. This dissertation studies the problem of making better use of communication resources in data replication environments than approaches based on periodic refreshing. In this dissertation, analysis of approximate replication environments is framed in terms of a two-dimensional space with axes denoting system performance (a measure of communication resource utilization) and replica precision (a measure of the degree of synchronization with remote master copies). There is a fundamental and unavoidable tradeoff between precision and performance: When data changes rapidly, good performance can only be achieved by sacrificing replica precision and, conversely, obtaining high precision tends to degrade performance. Two natural and complementary methods for working with the precision-performance tradeoff are proposed to achieve efficient communication resource utilization for replica synchronization: (1) Maximize replica precision in the presence of constraints on communication cost. (2) Minimize communication cost in the presence of constraints on replica precision. Problem definition, analysis, algorithms, and implementation techniques are developed for each method in turn, with the overall goal of creating a comprehensive framework for resource-efficient approximate replication. The effectiveness of each technique is verified using simulations over both synthetic and real-world data. In addition, a test-bed network traffic monitoring system is described, which uses some of the approximate replication techniques developed in this dissertation to track usage patterns and flag potential security hazards.