Approximate replication

  • Authors:
  • Jennifer Widom;Christopher Alden Remi Olston

  • Affiliations:
  • -;-

  • Venue:
  • Approximate replication
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In distributed environments that collect or monitor data, useful data may be spread across multiple distributed nodes, but users or applications may wish to access that data from a central location. A common way to facilitate centralized access to distributed data is to maintain replicas of data objects of interest at a central location. When data collections are large or volatile, keeping replicas consistent with remote master copies poses a significant challenge due to the large communication cost incurred. Consequently, in many real-world environments exact replica consistency is not maintained, and some form of inexact, or approximate, replication is typically used instead. Approximate replication is often performed by refreshing replicas periodically. Periodic refreshing allows communication cost to be controlled, but it does not always make good use of communication resources: In between refreshes some remote master copies may change significantly, leaving replicas excessively out of date and inaccurate, and meanwhile resources may be wasted refreshing replicas of other master copies that remain nearly unchanged. This dissertation studies the problem of making better use of communication resources in data replication environments than approaches based on periodic refreshing. In this dissertation, analysis of approximate replication environments is framed in terms of a two-dimensional space with axes denoting system performance (a measure of communication resource utilization) and replica precision (a measure of the degree of synchronization with remote master copies). There is a fundamental and unavoidable tradeoff between precision and performance: When data changes rapidly, good performance can only be achieved by sacrificing replica precision and, conversely, obtaining high precision tends to degrade performance. Two natural and complementary methods for working with the precision-performance tradeoff are proposed to achieve efficient communication resource utilization for replica synchronization: (1) Maximize replica precision in the presence of constraints on communication cost. (2) Minimize communication cost in the presence of constraints on replica precision. Problem definition, analysis, algorithms, and implementation techniques are developed for each method in turn, with the overall goal of creating a comprehensive framework for resource-efficient approximate replication. The effectiveness of each technique is verified using simulations over both synthetic and real-world data. In addition, a test-bed network traffic monitoring system is described, which uses some of the approximate replication techniques developed in this dissertation to track usage patterns and flag potential security hazards.