A scalable content-addressable network
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Journal of Algorithms
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
A study of practical deduplication
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A scalable inline cluster deduplication framework for big data protection
Proceedings of the 13th International Middleware Conference
Hi-index | 0.00 |
Deduplication technology is maturing and becoming a standard feature of most storage architectures. Many approaches have been proposed to address the deduplication scalability challenges of privately owned storage infrastructure, but as storage is moving to the cloud, deduplication mechanisms are expected to scale to thousands of storage nodes. Currently available solutions were not designed to handle such large scale, while research and practical experience suggests that aiming for global deduplication among thousands of nodes will, almost certainly, lead to high complexity, reduced performance and reduced reliability. Instead, we propose the idea of performing local deduplication operations within each cloud node, and introduce file similarity metrics to determine which node is the best deduplication host for a particular incoming file. This approach reduces the problem of scalable cloud deduplication to a file routing problem, which we can address using a software layer capable of making the necessary routing decisions. Using the proposed file routing middleware layer the system can achieve three important properties: scale to thousands of nodes, support almost any type of underlying storage node, and make the most of file-level deduplication.