To Zip or not to Zip: effective resource usage for real-time compression

Authors:
Danny Harnik;Ronen Kat;Oded Margalit;Dmitry Sotnikov;Avishay Traeger
Affiliations:
IBM Research-Haifa;IBM Research-Haifa;IBM Research-Haifa;IBM Research-Haifa;IBM Research-Haifa
Venue:
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Year:
2013

Citing 7
Cited 0

On-line data compression in a log-structured file system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Genetic-programming based prediction of data compression saving

EA'09 Proceedings of the 9th international conference on Artificial evolution
Quick Estimation of Data Compression and De-duplication for Large Storage Systems

CCP '11 Proceedings of the 2011 First International Conference on Data Compression, Communications and Processing
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
Insights for data reduction in primary storage: a practical analysis

Proceedings of the 5th Annual International Systems and Storage Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-time compression for primary storage is quickly becoming widespread as data continues to grow exponentially, but adding compression on the data path consumes scarce CPU and memory resources on the storage system. Our work aims to mitigate this cost by introducing methods to quickly and accurately identify the data that will yield significant space savings when compressed. The first level of filtering that we employ is at the data set level (e.g., volume or file system), where we estimate the overall compressibility of the data at rest. According to the outcome, we may choose to enable or disable compression for the entire data set, or to employ a second level of finer-grained filtering. The second filtering scheme examines data being written to the storage system in an online manner and determines its compressibility. The first-level filtering runs in mere minutes while providing mathematically proven guarantees on its estimates. In addition to aiding in selecting which volumes to compress, it has been released as a public tool, allowing potential customers to determine the effectiveness of compression on their data and to aid in capacity planning. The second-level filtering has shown significant CPU savings (up to 35%) while maintaining compression savings (within 2%).