Incremental shared nearest neighbor density-based clustering

Authors:
Sumeet Singh;Amit Awekar
Affiliations:
Indian Institute of Technology, Guwahati, India;Indian Institute of Technology, Guwahati, India
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 3
Cited 0

Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Shared Nearest Neighbor Density-based clustering (SNN-DBSCAN) is a robust graph-based clustering algorithm and has wide applications from climate data analysis to network intrusion detection. We propose an incremental extension to this algorithm IncSNN-DBSCAN, capable of finding clusters on a dataset to which frequent inserts are made. For each data point, the algorithm maintains four properties: nearest neighbor list, strengths of shared links, total connection strength and topic property. Algorithm only targets points that undergo change to their properties. We prove that, to obtain the exact clustering it is sufficient to re-compute properties for only the targeted points, followed by possible cluster mergers on newly formed links and cluster splits on the deleted links. Experiments on KDD Cup 1999 and Mopsi search engine 2012 datasets respectively demonstrate 75% and 99% reduction in the size of the set of points involved in property re-computations. By avoiding most of the redundant property computations, algorithm generates speedup up to 250 and 1000 times respectively on those datasets, while generating the exact same clustering as the non-incremental algorithm. We experimentally verify our claim for up to 2500 inserts on both datasets. However, speedup comes at the cost of up to 48 times more memory usage.