Real-time approximate Range Motif discovery & data redundancy removal algorithm

  • Authors:
  • Ankur Narang;Souvik Bhattcherjee

  • Affiliations:
  • IBM Research India, New Delhi, India;IBM Research - India, New Delhi, India

  • Venue:
  • Proceedings of the 14th International Conference on Extending Database Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and others there is a strong need for real-time data redundancy removal of enormous amounts of data flowing at the rate of 1Gb/s or higher. We consider the problem of finding Range Motifs (clusters) over records in a large dataset such that records within the same cluster are approximately close to each other. This problem is closely related to the approximate nearest neighbour search but is more computationally expensive. Real-time scalable approximate Range Motif discovery on massive datasets is a challenging problem. We present the design of novel sequential and parallel approximate Range Motif discovery and data de-duplication algorithms using Bloom filters. We establish asymptotic upper bounds on the false positive and false negative rates for our algorithm. Further, time complexity analysis of our parallel algorithm on multi-core architectures has been presented. For 10 million records, our parallel algorithm can perform approximate Range Motif discovery and data de-duplication, on 4 sets (clusters), in 59s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 170K records/s and around 700Mb/s (using records of size 4K bits). To the best of our knowledge, this is the highest real-time throughput for approximate Range Motif discovery and data redundancy removal on such massive datasets.