Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
The effectiveness of deduplication on virtual machine disk images
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
A study of practical deduplication
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Mixing Deduplication and Compression on Active Data Sets
DCC '11 Proceedings of the 2011 Data Compression Conference
Quick Estimation of Data Compression and De-duplication for Large Storage Systems
CCP '11 Proceedings of the 2011 First International Conference on Data Compression, Communications and Processing
Characteristics of backup workloads in production systems
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Leveraging predefined huffman dictionaries for high compression rate and ratio
Proceedings of the 6th International Systems and Storage Conference
To Zip or not to Zip: effective resource usage for real-time compression
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
There has been increasing interest in deploying data reduction techniques in primary storage systems. This paper analyzes large datasets in four typical enterprise data environments to find patterns that can suggest good design choices for such systems. The overall data reduction opportunity is evaluated for deduplication and compression, separately and combined, then in-depth analysis is presented focusing on frequency, clustering and other patterns in the collected data. The results suggest ways to enhance performance and reduce resource requirements and system cost while maintaining data reduction effectiveness. These techniques include deciding which files to compress based on file type and size, using duplication affinity to guide deployment decisions, and optimizing the detection and mapping of duplicate content adaptively when large segments account for most of the opportunity.