Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

  • Authors:
  • Torsten Suel;Patrick Noel;Dimitre Trendafilov

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDE '04 Proceedings of the 20th International Conference on Data Engineering
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of maintaining large replicated collectionsof files or documents in a distributed environment withlimited bandwidth. This problem arises in a number of importantapplications, such as synchronization of data betweenaccounts or devices, content distibution and web caching networks,web site mirroring, storage networks, and large scaleweb search and mining. At the core of the problem lies thefollowing challenge, called the file synchronization problem:given two versions of a file on different machines, say an outdatedand a current one, how can we update the outdatedversion with minimum communication cost, by exploiting thesignificant similarity between the versions? While a popularopen source tool for this problem called rsync is used in hundredsof thousands of installations, there have been only veryfew attempts to improve upon this tool in practice.In this paper, we propose a framework for remote file synchronizationand describe several new techniques that resultin significant bandwidth savings. Our focus is on applicationswhere very large collections have to be maintainedover slow connections. We show that a prototype implementationof our framework and techniques achieves significantimprovements over rsync. As an example application, we focuson the efficient synchronization of very large web pagecollections for the purpose of search, mining, and contentdistribution.