Reduct and variance based clustering of high dimensional dataset

Authors:
Dharmveer Singh Rajput;P. K. Singh;M. Bhattacharya
Affiliations:
ABV --- Indian Institute of Information Technology and Management, Gwalior, Madhya Pradesh, India;ABV --- Indian Institute of Information Technology and Management, Gwalior, Madhya Pradesh, India;ABV --- Indian Institute of Information Technology and Management, Gwalior, Madhya Pradesh, India
Venue:
ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management
Year:
2010

Citing 6
Cited 0

Data clustering: a review

ACM Computing Surveys (CSUR)
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Rough sets perspective on data and knowledge

Handbook of data mining and knowledge discovery
A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Effective initialization of k-means for color quantization

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In high dimensional data, general performance of the traditional clustering algorithms decreases. As some dimensions are likely to be irrelevant or contain noisy data and randomly selected initial centre of the clusters converge the clustering to local minima. In this paper, we propose a framework for clustering high dimensional data with attribute subset selection and efficient cluster centre initialization. It uses rough set theory to determine the relevant attributes (dimensions) in first phase. In second phase, maximum variance dimension is used to determine the optimal initial centres of the clusters. The k-means clustering algorithm is applied with these initial cluster centres, in phase three, to find optimal clustering of data set. It improves efficiency of the clustering process tremendously and our experiment on test data set shows that accuracy of the results has improved considerably.