Insights for data reduction in primary storage: a practical analysis

Authors:
Maohua Lu;David Chambliss;Joseph Glider;Cornel Constantinescu
Affiliations:
IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center;IBM Almaden Research Center
Venue:
Proceedings of the 5th Annual International Systems and Storage Conference
Year:
2012

Citing 8
Cited 2

Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Mixing Deduplication and Compression on Active Data Sets

DCC '11 Proceedings of the 2011 Data Compression Conference
Quick Estimation of Data Compression and De-duplication for Large Storage Systems

CCP '11 Proceedings of the 2011 First International Conference on Data Compression, Communications and Processing
Characteristics of backup workloads in production systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies

Leveraging predefined huffman dictionaries for high compression rate and ratio

Proceedings of the 6th International Systems and Storage Conference
To Zip or not to Zip: effective resource usage for real-time compression

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been increasing interest in deploying data reduction techniques in primary storage systems. This paper analyzes large datasets in four typical enterprise data environments to find patterns that can suggest good design choices for such systems. The overall data reduction opportunity is evaluated for deduplication and compression, separately and combined, then in-depth analysis is presented focusing on frequency, clustering and other patterns in the collected data. The results suggest ways to enhance performance and reduce resource requirements and system cost while maintaining data reduction effectiveness. These techniques include deciding which files to compress based on file type and size, using duplication affinity to guide deployment decisions, and optimizing the detection and mapping of duplicate content adaptively when large segments account for most of the opportunity.