Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
IEEE/ACM Transactions on Networking (TON)
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Longest prefix matching using bloom filters
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fast hash table lookup using extended bloom filter: an aid to network processing
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Theory, Volume 1, Queueing Systems
Theory, Volume 1, Queueing Systems
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Less hashing, same performance: Building a better Bloom filter
Random Structures & Algorithms
Optimizing Distributed Joins with Bloom Filters
ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Cache-, hash-, and space-efficient bloom filters
Journal of Experimental Algorithmics (JEA)
A multi-attribute data structure with parallel bloom filters for network services
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Real-time approximate Range Motif discovery & data redundancy removal algorithm
Proceedings of the 14th International Conference on Extending Database Technology
ICHIT'11 Proceedings of the 5th international conference on Convergence and hybrid information technology
TBF: a high-efficient query mechanism in de-duplication backup system
GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Hi-index | 0.00 |
The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.