High throughput data redundancy removal algorithm with scalable performance

Authors:
Souvik Bhattacherjee;Ankur Narang;Vikas K. Garg
Affiliations:
IBM Research, India;IBM Research, India;IBM Research, India
Venue:
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Year:
2011

Citing 21
Cited 3

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Scalable packet classification

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Compressed bloom filters

IEEE/ACM Transactions on Networking (TON)
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Longest prefix matching using bloom filters

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Fast hash table lookup using extended bloom filter: an aid to network processing

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Theory, Volume 1, Queueing Systems

Theory, Volume 1, Queueing Systems
Hierarchical Bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Less hashing, same performance: Building a better Bloom filter

Random Structures & Algorithms
Deep Packet Inspection using Parallel Bloom Filters

IEEE Micro
Optimizing Distributed Joins with Bloom Filters

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Cache-, hash-, and space-efficient bloom filters

Journal of Experimental Algorithmics (JEA)
A multi-attribute data structure with parallel bloom filters for network services

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
DTR-filter: an efficient transmission scheme for real-time monitoring in wireless bulky sensor networks

ICHIT'11 Proceedings of the 5th international conference on Convergence and hybrid information technology
TBF: a high-efficient query mechanism in de-duplication backup system

GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.