Garbage collecting the Internet: a survey of distributed garbage collection
ACM Computing Surveys (CSUR)
On-the-fly garbage collection: an exercise in cooperation
Communications of the ACM
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Uniprocessor Garbage Collection Techniques
IWMM '92 Proceedings of the International Workshop on Memory Management
A survey of peer-to-peer content distribution technologies
ACM Computing Surveys (CSUR)
Deep Store: An Archival Storage System Architecture
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Ursa minor: versatile cluster-based storage
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Glacier: highly durable, decentralized storage despite massive correlated failures
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Single instance storage in Windows® 2000
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Jumbo store: providing efficient incremental upload and versioning for a utility rendering service
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
HYDRAstor: a Scalable Secondary Storage
FAST '09 Proccedings of the 7th conference on File and storage technologies
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Decentralized deduplication in SAN cluster file systems
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Rethinking deduplication scalability
HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
dedupv1: Improving deduplication throughput using solid state drives (SSD)
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
MAD2: A scalable high-throughput exact deduplication approach for network backup services
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Hi-index | 0.00 |
Scalable, highly reliable distributed systems supporting data deduplication have recently become popular for storing backup and archival data. One of the important requirements for backup storage is the ability to delete data selectively. Unlike in traditional storage systems, data deletion in distributed systems with deduplication is a major challenge because deduplication leads to multiple owners of data chunks. Moreover, system configuration changes often due to node additions, deletions and failures. Expected high performance, high availability and low impact of deletion on regular user operations additionally complicate identification and reclamation of unnecessary blocks. This paper describes a deletion algorithm for a scalable, content-addressable storage with global deduplication. The deletion is concurrent: user reads and writes can proceed in parallel with deletion with only minor restrictions established to make reclamation feasible. Moreover, our approach allows for deduplication of user writes during deletion. We extend traditional distributed reference counting to deliver a failure-tolerant deletion that accommodates not only deduplication, but also the dynamic nature of a scalable system and its physical resource constraints. The proposed algorithm has been verified with an implementation in a commercial deduplicating storage system. The impact of deletion on user operations is configurable. Using a default setting that grants deletion maximum 30% of system resources running the deletion reduces end performance by not more that 30%. This impact can be reduced to less than 5% when deletion is given only minimal resources.