Tolerance rough set theory based data summarization for clustering large datasets

Authors:
Bidyut Kr. Patra;Sukumar Nandi
Affiliations:
Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, India;Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, India
Venue:
Transactions on rough sets XIV
Year:
2011

Citing 19
Cited 0

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Tolerance approximation spaces

Fundamenta Informaticae - Special issue: rough sets
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Rough set approach to incomplete information systems

Information Sciences: an International Journal
Data clustering: a review

ACM Computing Surveys (CSUR)
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Clustering Algorithms

Clustering Algorithms
Rough Sets and Data Mining: Analysis of Imprecise Data

Rough Sets and Data Mining: Analysis of Imprecise Data
A Generalized Definition of Rough Approximations Based on Similarity

IEEE Transactions on Knowledge and Data Engineering
Fast Hierarchical Clustering Based on Compressed Data and OPTICS

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Hierarchical Document Clustering Based on Tolerance Rough Set Model

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Incremental and effective data summarization for dynamic hierarchical clustering

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Rough clustering of sequential data

Data & Knowledge Engineering
Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Granular Sets --- Foundations and Case Study of Tolerance Spaces

RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Use of Fuzzy Rough Set Attribute Reduction in High Scent Web Page Recommendations

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Fast Single-Link Clustering Method Based on Tolerance Rough Set Model

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Distance based fast hierarchical clustering method for large datasets

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.