Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
Cumulus: filesystem backup to the cloud
FAST '09 Proccedings of the 7th conference on File and storage technologies
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
File routing middleware for cloud deduplication
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Characteristics of backup workloads in production systems
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Content-aware load balancing for distributed backup
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Hi-index | 0.00 |
Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great challenge for cluster deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio in cluster systems with low-end individual secondary storage nodes. We propose Σ-Dedupe, a scalable inline cluster deduplication framework, as a middleware deployable in cloud data centers, to meet this challenge by exploiting data similarity and locality to optimize cluster deduplication in inter-node and intra-node scenarios, respectively. Governed by a similarity-based stateful data routing scheme, Σ-Dedupe assigns similar data to the same backup server at the super-chunk granularity using a handprinting technique to maintain high cluster-deduplication efficiency without cross-node deduplication, and balances the workload of servers from backup clients. Meanwhile, Σ-Dedupe builds a similarity index over the traditional locality-preserved caching design to alleviate the chunk index-lookup bottleneck in each node. Extensive evaluation of our Σ-Dedupe prototype against state-of-the-art schemes, driven by real-world datasets, demonstrates that Σ-Dedupe achieves a cluster-wide duplicate elimination ratio almost as high as the high-overhead and poorly scalable traditional stateful routing scheme but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio stateless routing approaches.