Characteristics of backup workloads in production systems

  • Authors:
  • Grant Wallace;Fred Douglis;Hangwei Qian;Philip Shilane;Stephen Smaldone;Mark Chamness;Windsor Hsu

  • Affiliations:
  • Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Case Western Reserve University and Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation

  • Venue:
  • FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data-protection class workloads, including backup and long-term retention of data, have seen a strong industry shift from tape-based platforms to disk-based systems. But the latter are traditionally designed to serve as primary storage and there has been little published analysis of the characteristics of backup workloads as they relate to the design of disk-based systems. In this paper, we present a comprehensive characterization of backup workloads by analyzing statistics and content metadata collected from a large set of EMC Data Domain backup systems in production use. This analysis is both broad (encompassing statistics from over 10,000 systems) and deep (using detailed metadata traces from several production systems storing almost 700TB of backup data). We compare these systems to a detailed study of Microsoft primary storage systems [22], showing that backup storage differs significantly from their primary storage workload in the amount of data churn and capacity requirements as well as the amount of redundancy within the data. These properties bring unique challenges and opportunities when designing a disk-based filesystem for backup workloads, which we explore in more detail using the metadata traces. In particular, the need to handle high churn while leveraging high data redundancy is considered by looking at deduplication unit size and caching efficiency.