Block Size Optimization in Deduplication Systems

Authors:
Cornel Constantinescu;Jan Pieper;Tiancheng Li
Affiliations:
-;-;-
Venue:
DCC '09 Proceedings of the 2009 Data Compression Conference
Year:
2009

Citing 0
Cited 1

File recipe compression in data deduplication systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data deduplication is a popular dictionary based compression method in storage archival and backup.The deduplication efficiency (``chunk'' matching) improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or "chattiness"in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new ``super chunks'' with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence.With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata (fewer hashes and dictionary entries).