A global unsupervised data discretization algorithm based on collective correlation coefficient

  • Authors:
  • An Zeng;Qi-Gang Gao;Dan Pan

  • Affiliations:
  • Guangdong University of Technology and Dalhousie University;Dalhousie University;Saint Mary's University

  • Venue:
  • IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data discretization is an important task for certain types of data mining algorithms such as association rule discovery and Bayesian learning. For those algorithms, proper discretization not only can significantly improve the quality and understandability of discovered knowledge, but also can reduce the running time. We present a Global Unsupervised Discretization Algorithm based on Collective Correlation Coefficient (GUDA-CCC) that provides the following attractive merits. 1) It does not require class labels from training data. 2) It preserves the ranks of attribute importance in a data set and meanwhile minimizes the information loss measured by mean square error. The attribute importance is calibrated by the CCC derived from principal component analysis (PCA). The idea behind GUDA-CCC is that to stick closely to an original data set might be the best policy, especially when other available information is not reliable enough to be leveraged in the discretization. Experiments on benchmark data sets illustrate the effectiveness of the GUDA-CCC algorithm.