Content-aware load balancing for distributed backup

  • Authors:
  • Fred Douglis;Deepti Bhardwaj;Hangwei Qian;Philip Shilane

  • Affiliations:
  • EMC;EMC;Case Western Reserve Univ.;EMC

  • Venue:
  • LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

When backing up a large number of computer systems to many different storage devices, an administrator has to balance the workload to ensure the successful completion of all backups within a particular period of time. When these devices were magnetic tapes, this assignment was trivial: find an idle tape drive, write what fits on a tape, and replace tapes as needed. Backing up data onto deduplicating disk storage adds both complexity and opportunity. Since one cannot swap out a filled disk-based file system the way one switches tapes, each separate backup appliance needs an appropriate workload that fits into both the available storage capacity and the throughput available during the backup window. Repeating a given client's backups on the same appliance not only reduces capacity requirements but it can improve performance by eliminating duplicates from network traffic. Conversely, any reconfiguration of the mappings of backup clients to appliances suffers the overhead of repopulating the new appliance with a full copy of a client's data. Reassigning clients to new servers should only be done when the need for load balancing exceeds the overhead of the move. In addition, deduplication offers the opportunity for content-aware load balancing that groups clients together for improved deduplication that can further improve both capacity and performance; we have seen a system with as much as 75% of its data overlapping other systems, though overlap around 10% is more common. We describe an approach for clustering backup clients based on content, assigning them to backup appliances, and adapting future configurations based on changing requirements while minimizing client migration. We define a cost function and compare several algorithms for minimizing this cost. This assignment tool resides in a tier between backup software such as EMC NetWorker and deduplicating storage systems such as EMC Data Domain.