SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Authors:
Guiping Wang;Shuyu Chen;Mingwei Lin;Xiaowei Liu
Affiliations:
College of Computer Science, Chongqing University, Chongqing 400044, China;College of Software Engineering, Chongqing University, Chongqing 400044, China;College of Computer Science, Chongqing University, Chongqing 400044, China;HuaWei Research Institute, Chengdu, Sichuan 610041, China
Venue:
Expert Systems with Applications: An International Journal
Year:
2014

Citing 16
Cited 0

A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Reclaiming Space from Duplicate Files in a Serverless Distributed File System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Providing High Reliability in a Minimum Redundancy Archival Storage System

MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Improving duplicate elimination in storage systems

ACM Transactions on Storage (TOS)
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Cumulus: filesystem backup to the cloud

FAST '09 Proccedings of the 7th conference on File and storage technologies
Side Channels in Cloud Services: Deduplication in Cloud Storage

IEEE Security and Privacy
A study of practical deduplication

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Efficient Deduplication Techniques for Modern Backup Operation

IEEE Transactions on Computers
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
A Genetic Programming Approach to Record Deduplication

IEEE Transactions on Knowledge and Data Engineering
Hash challenges: Stretching the limits of compare-by-hash in distributed data deduplication

Information Processing Letters
Deduplication flash file system with PRAM for non-linear editing

IEEE Transactions on Consumer Electronics
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Efficient and Effective Duplicate Detection in Hierarchical Data

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	12.05

Visualization

Abstract

With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8kB.