A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Reclaiming Space from Duplicate Files in a Serverless Distributed File System
ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Providing High Reliability in a Minimum Redundancy Archival Storage System
MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Improving duplicate elimination in storage systems
ACM Transactions on Storage (TOS)
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Cumulus: filesystem backup to the cloud
FAST '09 Proccedings of the 7th conference on File and storage technologies
Side Channels in Cloud Services: Deduplication in Cloud Storage
IEEE Security and Privacy
A study of practical deduplication
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Efficient Deduplication Techniques for Modern Backup Operation
IEEE Transactions on Computers
DeFFS: Duplication-eliminated flash file system
Computers and Electrical Engineering
A Genetic Programming Approach to Record Deduplication
IEEE Transactions on Knowledge and Data Engineering
Hash challenges: Stretching the limits of compare-by-hash in distributed data deduplication
Information Processing Letters
Deduplication flash file system with PRAM for non-linear editing
IEEE Transactions on Consumer Electronics
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Efficient and Effective Duplicate Detection in Hierarchical Data
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 12.05 |
With the explosive growth of data, storage systems are facing huge storage pressure due to a mass of redundant data caused by the duplicate copies or regions of files. Data deduplication is a storage-optimization technique that reduces the data footprint by eliminating multiple copies of redundant data and storing only unique data. The basis of data deduplication is duplicate data detection techniques, which divide files into a number of parts, compare corresponding parts between files via hash techniques and find out redundant data. This paper proposes an efficient sliding blocking algorithm with backtracking sub-blocks called SBBS for duplicate data detection. SBBS improves the duplicate data detection precision of the traditional sliding blocking (SB) algorithm via backtracking the left/right 1/4 and 1/2 sub-blocks in matching-failed segments. Experimental results show that SBBS averagely improves the duplicate detection precision by 6.5% compared with the traditional SB algorithm and by 16.5% compared with content-defined chunking (CDC) algorithm, and it does not increase much extra storage overhead when SBBS divides the files into equal chunks of size 8kB.