Tolerance rough set theory based data summarization for clustering large datasets

  • Authors:
  • Bidyut Kr. Patra;Sukumar Nandi

  • Affiliations:
  • Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, India;Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, India

  • Venue:
  • Transactions on rough sets XIV
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.