Characteristics of backup workloads in production systems

Authors:
Grant Wallace;Fred Douglis;Hangwei Qian;Philip Shilane;Stephen Smaldone;Mark Chamness;Windsor Hsu
Affiliations:
Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Case Western Reserve University and Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation;Backup Recovery Systems Division, EMC Corporation
Venue:
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Year:
2012

Citing 28
Cited 21

Data compression

ACM Computing Surveys (CSUR)
Characteristics of files in NFS environments

SIGSMALL '91 Proceedings of the 1991 ACM SIGSMALL/PC symposium on Small systems
Measurements of a distributed file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A large-scale study of file-system contents

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A trace-driven analysis of the UNIX 4.2 BSD file system

Proceedings of the tenth ACM symposium on Operating systems principles
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
A study of file sizes and functional lifetimes

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Characteristics of I/O traffic in personal computer and server workloads

IBM Systems Journal
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Alternatives for detecting redundancy in storage systems data

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
A comparison of file system workloads

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
A five-year study of file-system metadata

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Measurement and analysis of large-scale network file system workloads

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Characterizing datasets for data deduplication in backup applications

IISWC '10 Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
SFS: random write considered harmful in solid state drives

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Capacity forecasting in a backup storage environment

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration

WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Power consumption in enterprise-scale backup storage systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Generating realistic datasets for deduplication analysis

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Insights for data reduction in primary storage: a practical analysis

Proceedings of the 5th Annual International Systems and Storage Conference
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Probabilistic deduplication for cluster-based storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference
Block locality caching for data deduplication

Proceedings of the 6th International Systems and Storage Conference
Building intelligence for software defined data centers: modeling usage patterns

Proceedings of the 6th International Systems and Storage Conference
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
ROOT: replaying multithreaded traces with resource-oriented ordering

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
DupLESS: server-aided encryption for deduplicated storage

SEC'13 Proceedings of the 22nd USENIX conference on Security
Efficiently storing virtual machine backups

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Characterization of incremental data changes for efficient data protection

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Improving restore speed for backup systems that use inline chunk-based deduplication

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
(Big)data in a virtualized world: volume, velocity, and variety in cloud datacenters

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Migratory compression: coarse-grained data reordering to improve compressibility

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-protection class workloads, including backup and long-term retention of data, have seen a strong industry shift from tape-based platforms to disk-based systems. But the latter are traditionally designed to serve as primary storage and there has been little published analysis of the characteristics of backup workloads as they relate to the design of disk-based systems. In this paper, we present a comprehensive characterization of backup workloads by analyzing statistics and content metadata collected from a large set of EMC Data Domain backup systems in production use. This analysis is both broad (encompassing statistics from over 10,000 systems) and deep (using detailed metadata traces from several production systems storing almost 700TB of backup data). We compare these systems to a detailed study of Microsoft primary storage systems [22], showing that backup storage differs significantly from their primary storage workload in the amount of data churn and capacity requirements as well as the amount of redundancy within the data. These properties bring unique challenges and opportunities when designing a disk-based filesystem for backup workloads, which we explore in more detail using the metadata traces. In particular, the need to handle high churn while leveraging high data redundancy is considered by looking at deduplication unit size and caching efficiency.